hi Ishaan, Full support for converting between Arrow's and Parquet's nested data representation in
https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow is not yet complete. I have no estimate on when the work will be completed since I'm not sure who's going to do the work. I will eventually work on it myself, but I have no urgent need, so if it's me, it may be sometime later this year. I think it would be a really interesting project for someone wishing to master both the Arrow format and C++ API and Parquet nested data encoding. Separate from that, we could definitely have much better documentation about different ways to construct nested data in Python. thanks Wes On Thu, Feb 1, 2018 at 3:46 AM, Ishaan Joshi <ish...@apache.org> wrote: > Wes and co., > > First off, great project ! I was able to read the docs and get going in > under a day, the APIs are super easy to use. That being said, I'm a tad > stuck, and having exhausted google-fu, am here to assistance. I want to use > pyarrow to write a nested dataset in parquet. The schema is quite complex, > and I'm having difficulty getting going with arrays for nested data > structures. For e.g, a column in my schema look like this: > > In [7]: schema > > Out[7]: > > cstruct: struct<field1: double, field2: struct<field1: string>, field3: > list<item: int32>, field4: list<struct: struct<field1: int32>>> > > child 0, field1: double > > child 1, field2: struct<field1: string> > > child 0, field1: string > > child 2, field3: list<item: int32> > > child 0, item: int32 > > child 3, field4: list<struct: struct<field1: int32>> > > child 0, struct: struct<field1: int32> > > child 0, field1: int32 > > How would I go constructing a row with this type? I've been looking at > StructArray and ListArray. I've found the following links during my > research: > > * https://github.com/apache/arrow/issues/1217 > > * > https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python > > * > https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b > > I've managed to wrangle everything but ListArrays, e.g: > > field1_data = pa.array([1.1], type=pa.float64()) > > field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'], > type=pa.string())]) > > field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32())) > > I've having trouble with field4: > > field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1], > type=pa.int32())]) > > field4_data = pa.ListArray.from_arrays(??, field4_struct) > > In particular, what does the offset value mean, and how do I populate it? > > Thanks in advance for all the help. > > -- Ishaan