Hi,
We have a Parquet file with more than 1000 columns of nested types, and the
columns are sparse, namely most columns per row are nulls.
When writing the Parquet, the performance is very slow on CPU. Profiler shows
that MessageColumnIORecordConsumer.writeNull is called
recursively and each recursion gets ever larger number of invocations by
approximately 35X.
The following code in MessageColumnIO.java shows where the problem could be:
private void writeNull(ColumnIO undefinedField, int r, int d) {
if (undefinedField.getType().isPrimitive()) {
columnWriter[((PrimitiveColumnIO)undefinedField).getId()].writeNull(r,
d);
} else {
GroupColumnIO groupColumnIO = (GroupColumnIO)undefinedField;
int childrenCount = groupColumnIO.getChildrenCount();
for (int i = 0; i < childrenCount; i++) {
writeNull(groupColumnIO.getChild(i), r, d);
}
}
}
As red marked, the recursion occurring in the loop seems to cause the explosion
of the number of invocation calls.
My question is: Since this writeNull is only called for a missing field at a
level, and all its descendents are known to be missing and their count are
known from schema, will there be possibly a more efficient way to store the
information than the current store of all of the descendants' missing indicator?
Or is there a workaround to avoid this "trap" for now?
Thanks for help!