Jim Pivarski created ARROW-16348: ------------------------------------ Summary: ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back Key: ARROW-16348 URL: https://issues.apache.org/jira/browse/ARROW-16348 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 7.0.0 Environment: pyarrow 7.0.0 installed via pip. Reporter: Jim Pivarski
I've been happily making ExtensionArrays, but recently noticed that they aren't preserved by round-trips through Parquet files when {{{}use_compliant_nested_type=True{}}}. Consider this writer.py: {code:java} import json import numpy as np import pyarrow as pa import pyarrow.parquet as pq class AnnotatedType(pa.ExtensionType): def __init__(self, storage_type, annotation): self.annotation = annotation super().__init__(storage_type, "my:app") def __arrow_ext_serialize__(self): return json.dumps(self.annotation).encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): annotation = json.loads(serialized.decode()) return cls(storage_type, annotation) @property def num_buffers(self): return self.storage_type.num_buffers @property def num_fields(self): return self.storage_type.num_fields pa.register_extension_type(AnnotatedType(pa.null(), None)) array = pa.Array.from_buffers( AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}), 3, [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))], children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])], ) table = pa.table({"": array}) print(table) pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True) {code} And this reader.py: {code:java} import json import numpy as np import pyarrow as pa import pyarrow.parquet as pq class AnnotatedType(pa.ExtensionType): def __init__(self, storage_type, annotation): self.annotation = annotation super().__init__(storage_type, "my:app") def __arrow_ext_serialize__(self): return json.dumps(self.annotation).encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): annotation = json.loads(serialized.decode()) return cls(storage_type, annotation) @property def num_buffers(self): return self.storage_type.num_buffers @property def num_fields(self): return self.storage_type.num_fields pa.register_extension_type(AnnotatedType(pa.null(), None)) table = pq.read_table("tmp.parquet") print(table) {code} (The AnnotatedType is the same; I wrote it twice for explicitness.) When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is {code:java} % python writer.py pyarrow.Table : extension<my:app<AnnotatedType>> ---- : [[[1.1,2.2,3.3],[],[4.4,5.5]]] % python reader.py pyarrow.Table : extension<my:app<AnnotatedType>> ---- : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code} In other words, the AnnotatedType is preserved. When {{{}use_compliant_nested_type=True{}}}, however, {code:java} % rm tmp.parquet rm: remove regular file 'tmp.parquet'? y % python writer.py pyarrow.Table : extension<my:app<AnnotatedType>> ---- : [[[1.1,2.2,3.3],[],[4.4,5.5]]] % python reader.py pyarrow.Table : list<element: double> child 0, element: double ---- : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code} The issue doesn't seem to be in the writing, but in the reading: regardless of whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can see the extension metadata in the Parquet → Arrow converted schema. {code:java} >>> import pyarrow.parquet as pq >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema() : list<item: double> child 0, item: double -- field metadata -- ARROW:extension:metadata: '{"cool": "beans"}' ARROW:extension:name: 'my:app'{code} versus {code:java} >>> import pyarrow.parquet as pq >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema() : list<element: double> child 0, element: double -- field metadata -- ARROW:extension:metadata: '{"cool": "beans"}' ARROW:extension:name: 'my:app'{code} Note that the first has "{{{}item: double{}}}" and the second has "{{{}element: double{}}}". (I'm also rather surprised that {{use_compliant_nested_type=False}} is an option. Wouldn't you want the Parquet files to always be written with compliant lists? I noticed this when I was having trouble getting the data into BigQuery.) -- This message was sent by Atlassian Jira (v8.20.7#820007)