Hi,
I've been doing some work trying to get the parquet read path going for the
python iceberg <https://github.com/apache/incubator-iceberg> library. I
have two questions that I couldn't get figured out, and was hoping I could
get some guidance from the list here.
First, I'd like to create a ParquetSchema->IcebergSchema converter, but it
appears that only limited information is available in the ColumnSchema
passed back to the python client[2]:
<ParquetColumnSchema>
name: key
path: m.map.key
max_definition_level: 2
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: UTF8
<ParquetColumnSchema>
name: key
path: m.map.value.map.key
max_definition_level: 4
max_repetition_level: 2
physical_type: BYTE_ARRAY
logical_type: UTF8
<ParquetColumnSchema>
name: value
path: m.map.value.map.value
max_definition_level: 5
max_repetition_level: 2
physical_type: BYTE_ARRAY
logical_type: UTF8
where physical_type and logical_type are both strings[1]. The arrow schema
I can get from *to_arrow_schema *looks to be more expressive(although may
be I just don't understand the parquet format well enough):
m: struct<map: list<map: struct<key: string, value: struct<map: list<map:
struct<key: string, value: string> not null>>> not null>>
child 0, map: list<map: struct<key: string, value: struct<map: list<map:
struct<key: string, value: string> not null>>> not null>
child 0, map: struct<key: string, value: struct<map: list<map:
struct<key: string, value: string> not null>>>
child 0, key: string
child 1, value: struct<map: list<map: struct<key: string, value:
string> not null>>
child 0, map: list<map: struct<key: string, value: string>
not null>
child 0, map: struct<key: string, value: string>
child 0, key: string
child 1, value: string
It seems like I can infer the info from the name/path, but is there a more
direct way of getting the detailed parquet schema information?
Second question, is there a way to push record level filtering into the
parquet reader, so that the parquet reader only reads in values that match
a given predicate expression? Predicate expressions would be simple
field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
connected with logical operators(AND, OR, NOT).
I've seen that after reading-in I can use the filtering language in
gandiva[3] to get filtered record-batches, but was looking for somewhere
lower in the stack if possible.
[1]
https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
[2] Spark/Hive Table DDL for this parquet file looks like:
CREATE TABLE `iceberg`.`nested_map` (
m map<string,map<string,string>>)
[3]
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100