Hello Parquet devs,

I have been working more on the Drill implementation of parquet to bring us
up to full read compatibility as well as implement write support. We are
using the RecordConsumer interface to write data into parquet files, and it
seems that we have hit a bug when writing repeated data.

I am currently just doing a simple test with a repeated field at the root
of the schema. I am writing in data I am pulling in from a json file, where
each record contains one repeated Long column with seven items. The problem
is appearing when we hit one of the page thresholds, the ColumnWriterImpl
is writing only the values from one of the lists that fit in the current
page, not the entire list. Thus the 'value' within that column is being
split across two pages. I took a look at the source and it does not look
like the ColumnWriterImp is actually ensuring that a list ends before
cutting off the page. With the implementation of the repetition levels, I
believe this can only be indicated by actually reading one value from the
next list (when the repetition level hits 0). It seems like the actual
writes to the page data should be cached inside of the writer until it is
determines that the entire list of values will fit in the page.

Is there something that I am missing? I did not find any open issues for
this, but I will work on a patch to see if I can get it working for me.

Jason Altekruse
MapR - Software Engineer
Apache Drill Team

Reply via email to