Hi Jason, You are correct the current implementation will write a new page whenever it reaches the threshold irrespective of row boundaries. The column reader abstracts out pages for the assembly algorithm. As you said it would be beneficial for things like predicate push down to have pages cut on row boundaries and the page header contain the row count. We had planed to implement this with the new page format (https://github.com/Parquet/parquet-format/pull/64). However this is not done yet. I'd be happy to help if you're willing to either implement the change (not too big) or if you want to make an incremental change in that direction. The main thing would be to know from the format if the pages end on row boundaries (for backward compatibility) for example adding an optional row_count field to the page would be a good indicator.
Julien > On Jun 30, 2014, at 15:00, Jason Altekruse <[email protected]> wrote: > > Hi Dmitriy, > > Currently the implementation supports this behavior in both the reader and > the writer, so it is technically designed to handle it. However I was under > the impression that part of the motivation of the page abstraction was > defining a maximum amount of data needed to decompress if you wanted to do > a point query into a parquet file (assuming you had some kind of index > which told you which pages to look in to find a particular range of values > for each column). > > We are working on adding support for repeated columns that exhibit this > behavior, as it has been part of parquet since the initial release. However > it seemed like it might have been a design oversight. I don't think there > need to be any additional overhead to support the change, if we would > change the model a little bit to process repetition and definition levels > up to the point where a record completes, and then at that time copy all of > the data in a tight loop, the performance should be nearly identical to the > current implementation, if not a little better because all of the copies > are a little closer together which may help the CPU memory cache. If this > is the case it might be worth saving up several of the lists from > successive records as the page data buffer will be able to stay in the > cache longer. > > -Jason > > > >> On Mon, Jun 30, 2014 at 12:17 PM, Dmitriy Ryaboy <[email protected]> wrote: >> >> Not sure I have all the details straight, but it seems like this caching >> can be problematic for very large lists. Is there a way to design this so >> it can span pages? Or does this not make sense since a single record has to >> fit in a page? >> >> >> >> >> On Mon, Jun 30, 2014 at 9:10 AM, Jason Altekruse <[email protected] >> wrote: >> >>> Hello Parquet devs, >>> >>> I have been working more on the Drill implementation of parquet to bring >> us >>> up to full read compatibility as well as implement write support. We are >>> using the RecordConsumer interface to write data into parquet files, and >> it >>> seems that we have hit a bug when writing repeated data. >>> >>> I am currently just doing a simple test with a repeated field at the root >>> of the schema. I am writing in data I am pulling in from a json file, >> where >>> each record contains one repeated Long column with seven items. The >> problem >>> is appearing when we hit one of the page thresholds, the ColumnWriterImpl >>> is writing only the values from one of the lists that fit in the current >>> page, not the entire list. Thus the 'value' within that column is being >>> split across two pages. I took a look at the source and it does not look >>> like the ColumnWriterImp is actually ensuring that a list ends before >>> cutting off the page. With the implementation of the repetition levels, I >>> believe this can only be indicated by actually reading one value from the >>> next list (when the repetition level hits 0). It seems like the actual >>> writes to the page data should be cached inside of the writer until it is >>> determines that the entire list of values will fit in the page. >>> >>> Is there something that I am missing? I did not find any open issues for >>> this, but I will work on a patch to see if I can get it working for me. >>> >>> Jason Altekruse >>> MapR - Software Engineer >>> Apache Drill Team >>
