Re: parquet-format parquet.thrift struct ColumnMetaData problem

Nong Li Sat, 23 Jan 2016 16:45:46 -0800

Inline.

On Sat, Jan 23, 2016 at 8:48 AM, Tenghuan He <tenghua...@gmail.com> wrote:


> Hi everyone,
>
> In parquet.thrift the definition of struct ColumnMetaData
>
>    1.
>
>    The field "path_in_schema" is a string list, should not there be only
>    one path in the schema for a specified column? And in parquet-hadoop the
>    corresponding class "ColumnChunkMetaData" there is the field "ColumnPath
>    path", which is not a list.
>
The list is the pieces of the path. For example: struct1.struct2.field1
would have
a three element list. This is typically how the consumer wants to use the
path
and it avoids issues like how to escape dots and what not.

Each column has a unique path.


>    2.
>
>    The field "codec" which represents the compression codec of the column,
>    why is it not a list? Must all pages in the same column use the same
>    compression codec?
>
> Can anyone explain this?
>
Yes, all pages need the same compression.  This would be easy to change
(each
page can have a different encoding already) but we' need some good evidence
that this helps in practice. We already don't explore all the ways to use
the encodings
and imo, we should move away from general purpose compression and just rely
on
the encodings.


>
> Below is the definition snippet of ColumnMetaData in parquet.thrift.
>
> struct ColumnMetaData {
>   ...
>   3: required list<string> path_in_schema
>
>   4: required CompressionCodec codec
>   ...
> }
>
> Thanks & Best Regards
>
> ——————————
>
> Tenghuan He
>

Re: parquet-format parquet.thrift struct ColumnMetaData problem

Reply via email to