Hi Dmitriy, Currently the implementation supports this behavior in both the reader and the writer, so it is technically designed to handle it. However I was under the impression that part of the motivation of the page abstraction was defining a maximum amount of data needed to decompress if you wanted to do a point query into a parquet file (assuming you had some kind of index which told you which pages to look in to find a particular range of values for each column).
We are working on adding support for repeated columns that exhibit this behavior, as it has been part of parquet since the initial release. However it seemed like it might have been a design oversight. I don't think there need to be any additional overhead to support the change, if we would change the model a little bit to process repetition and definition levels up to the point where a record completes, and then at that time copy all of the data in a tight loop, the performance should be nearly identical to the current implementation, if not a little better because all of the copies are a little closer together which may help the CPU memory cache. If this is the case it might be worth saving up several of the lists from successive records as the page data buffer will be able to stay in the cache longer. -Jason On Mon, Jun 30, 2014 at 12:17 PM, Dmitriy Ryaboy <[email protected]> wrote: > Not sure I have all the details straight, but it seems like this caching > can be problematic for very large lists. Is there a way to design this so > it can span pages? Or does this not make sense since a single record has to > fit in a page? > > > > > On Mon, Jun 30, 2014 at 9:10 AM, Jason Altekruse <[email protected] > > > wrote: > > > Hello Parquet devs, > > > > I have been working more on the Drill implementation of parquet to bring > us > > up to full read compatibility as well as implement write support. We are > > using the RecordConsumer interface to write data into parquet files, and > it > > seems that we have hit a bug when writing repeated data. > > > > I am currently just doing a simple test with a repeated field at the root > > of the schema. I am writing in data I am pulling in from a json file, > where > > each record contains one repeated Long column with seven items. The > problem > > is appearing when we hit one of the page thresholds, the ColumnWriterImpl > > is writing only the values from one of the lists that fit in the current > > page, not the entire list. Thus the 'value' within that column is being > > split across two pages. I took a look at the source and it does not look > > like the ColumnWriterImp is actually ensuring that a list ends before > > cutting off the page. With the implementation of the repetition levels, I > > believe this can only be indicated by actually reading one value from the > > next list (when the repetition level hits 0). It seems like the actual > > writes to the page data should be cached inside of the writer until it is > > determines that the entire list of values will fit in the page. > > > > Is there something that I am missing? I did not find any open issues for > > this, but I will work on a patch to see if I can get it working for me. > > > > Jason Altekruse > > MapR - Software Engineer > > Apache Drill Team > > >
