Not sure I have all the details straight, but it seems like this caching
can be problematic for very large lists. Is there a way to design this so
it can span pages? Or does this not make sense since a single record has to
fit in a page?




On Mon, Jun 30, 2014 at 9:10 AM, Jason Altekruse <[email protected]>
wrote:

> Hello Parquet devs,
>
> I have been working more on the Drill implementation of parquet to bring us
> up to full read compatibility as well as implement write support. We are
> using the RecordConsumer interface to write data into parquet files, and it
> seems that we have hit a bug when writing repeated data.
>
> I am currently just doing a simple test with a repeated field at the root
> of the schema. I am writing in data I am pulling in from a json file, where
> each record contains one repeated Long column with seven items. The problem
> is appearing when we hit one of the page thresholds, the ColumnWriterImpl
> is writing only the values from one of the lists that fit in the current
> page, not the entire list. Thus the 'value' within that column is being
> split across two pages. I took a look at the source and it does not look
> like the ColumnWriterImp is actually ensuring that a list ends before
> cutting off the page. With the implementation of the repetition levels, I
> believe this can only be indicated by actually reading one value from the
> next list (when the repetition level hits 0). It seems like the actual
> writes to the page data should be cached inside of the writer until it is
> determines that the entire list of values will fit in the page.
>
> Is there something that I am missing? I did not find any open issues for
> this, but I will work on a patch to see if I can get it working for me.
>
> Jason Altekruse
> MapR - Software Engineer
> Apache Drill Team
>

Reply via email to