Writing nested parquet data using pyarrow

Ishaan Joshi Thu, 01 Feb 2018 00:46:55 -0800

Wes and co.,

First off, great project ! I was able to read the docs and get going in
under a day, the APIs are super easy to use. That being said, I'm a tad
stuck, and having exhausted google-fu, am here to assistance. I want to use
pyarrow to write a nested dataset in parquet. The schema is quite complex,
and I'm having difficulty getting going with arrays for nested data
structures. For e.g, a column in my schema look like this:


In [7]: schema

Out[7]:

cstruct: struct<field1: double, field2: struct<field1: string>, field3:
list<item: int32>, field4: list<struct: struct<field1: int32>>>

  child 0, field1: double

  child 1, field2: struct<field1: string>

      child 0, field1: string

  child 2, field3: list<item: int32>

      child 0, item: int32

  child 3, field4: list<struct: struct<field1: int32>>

      child 0, struct: struct<field1: int32>

          child 0, field1: int32

How would I go constructing a row with this type? I've been looking at
StructArray and ListArray. I've found the following links during my
research:

* https://github.com/apache/arrow/issues/1217

*
https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python

*
https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b

I've managed to wrangle everything but ListArrays, e.g:

field1_data = pa.array([1.1], type=pa.float64())

field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
type=pa.string())])

field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))

I've having trouble with field4:

field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
type=pa.int32())])

field4_data = pa.ListArray.from_arrays(??, field4_struct)

In particular, what does the offset value mean, and how do I populate it?

Thanks in advance for all the help.

-- Ishaan

Writing nested parquet data using pyarrow

Reply via email to