It is certainly possible to avoid the recursion and improve this.
As you mentioned, the schema is known in advance.
Pull requests are welcome if you want to take a stab at it.

On Tue, Oct 21, 2014 at 9:43 AM, Yan Zhou.sc <[email protected]> wrote:

> Hi,
>
> We have a Parquet file with more than 1000 columns of nested types, and
> the columns are sparse, namely most columns per row are nulls.
> When writing the Parquet, the performance is very slow on CPU. Profiler
> shows that MessageColumnIORecordConsumer.writeNull is called
> recursively and each recursion gets ever larger number of invocations by
> approximately 35X.
>
> The following code in MessageColumnIO.java shows where the problem could
> be:
>
>
>     private void writeNull(ColumnIO undefinedField, int r, int d) {
>
>
>       if (undefinedField.getType().isPrimitive()) {
>
>
>
> columnWriter[((PrimitiveColumnIO)undefinedField).getId()].writeNull(r, d);
>
>
>       } else {
>
>
>         GroupColumnIO groupColumnIO = (GroupColumnIO)undefinedField;
>
>
>         int childrenCount = groupColumnIO.getChildrenCount();
>
>
>         for (int i = 0; i < childrenCount; i++) {
>
>
>           writeNull(groupColumnIO.getChild(i), r, d);
>
>
>         }
>
>
>       }
>
>
>     }
>
>
> As red marked, the recursion occurring in the loop seems to cause the
> explosion of the number of invocation calls.
>
> My question is: Since this writeNull is only called for a missing field at
> a level, and all its descendents are known to be missing and their count
> are known from schema, will there be possibly a more efficient way to store
> the information than the current store of all of the descendants' missing
> indicator?
>
> Or is there a workaround to avoid this "trap" for now?
>
>
> Thanks for help!
>

Reply via email to