Hello Parquet devs, I have been working more on the Drill implementation of parquet to bring us up to full read compatibility as well as implement write support. We are using the RecordConsumer interface to write data into parquet files, and it seems that we have hit a bug when writing repeated data.
I am currently just doing a simple test with a repeated field at the root of the schema. I am writing in data I am pulling in from a json file, where each record contains one repeated Long column with seven items. The problem is appearing when we hit one of the page thresholds, the ColumnWriterImpl is writing only the values from one of the lists that fit in the current page, not the entire list. Thus the 'value' within that column is being split across two pages. I took a look at the source and it does not look like the ColumnWriterImp is actually ensuring that a list ends before cutting off the page. With the implementation of the repetition levels, I believe this can only be indicated by actually reading one value from the next list (when the repetition level hits 0). It seems like the actual writes to the page data should be cached inside of the writer until it is determines that the entire list of values will fit in the page. Is there something that I am missing? I did not find any open issues for this, but I will work on a patch to see if I can get it working for me. Jason Altekruse MapR - Software Engineer Apache Drill Team
