Wes and co.,
First off, great project ! I was able to read the docs and get going in
under a day, the APIs are super easy to use. That being said, I'm a tad
stuck, and having exhausted google-fu, am here to assistance. I want to use
pyarrow to write a nested dataset in parquet. The schema is quite complex,
and I'm having difficulty getting going with arrays for nested data
structures. For e.g, a column in my schema look like this:
In [7]: schema
Out[7]:
cstruct: struct<field1: double, field2: struct<field1: string>, field3:
list<item: int32>, field4: list<struct: struct<field1: int32>>>
child 0, field1: double
child 1, field2: struct<field1: string>
child 0, field1: string
child 2, field3: list<item: int32>
child 0, item: int32
child 3, field4: list<struct: struct<field1: int32>>
child 0, struct: struct<field1: int32>
child 0, field1: int32
How would I go constructing a row with this type? I've been looking at
StructArray and ListArray. I've found the following links during my
research:
* https://github.com/apache/arrow/issues/1217
*
https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python
*
https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b
I've managed to wrangle everything but ListArrays, e.g:
field1_data = pa.array([1.1], type=pa.float64())
field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
type=pa.string())])
field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))
I've having trouble with field4:
field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
type=pa.int32())])
field4_data = pa.ListArray.from_arrays(??, field4_struct)
In particular, what does the offset value mean, and how do I populate it?
Thanks in advance for all the help.
-- Ishaan