Inline. On Sat, Jan 23, 2016 at 8:48 AM, Tenghuan He <tenghua...@gmail.com> wrote:
> Hi everyone, > > In parquet.thrift the definition of struct ColumnMetaData > > 1. > > The field "path_in_schema" is a string list, should not there be only > one path in the schema for a specified column? And in parquet-hadoop the > corresponding class "ColumnChunkMetaData" there is the field "ColumnPath > path", which is not a list. > The list is the pieces of the path. For example: struct1.struct2.field1 would have a three element list. This is typically how the consumer wants to use the path and it avoids issues like how to escape dots and what not. Each column has a unique path. > 2. > > The field "codec" which represents the compression codec of the column, > why is it not a list? Must all pages in the same column use the same > compression codec? > > Can anyone explain this? > Yes, all pages need the same compression. This would be easy to change (each page can have a different encoding already) but we' need some good evidence that this helps in practice. We already don't explore all the ways to use the encodings and imo, we should move away from general purpose compression and just rely on the encodings. > > Below is the definition snippet of ColumnMetaData in parquet.thrift. > > struct ColumnMetaData { > ... > 3: required list<string> path_in_schema > > 4: required CompressionCodec codec > ... > } > > Thanks & Best Regards > > —————————— > > Tenghuan He >