hi Ishaan,

Full support for converting between Arrow's and Parquet's nested data
representation in

https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow

is not yet complete. I have no estimate on when the work will be
completed since I'm not sure who's going to do the work. I will
eventually work on it myself, but I have no urgent need, so if it's
me, it may be sometime later this year. I think it would be a really
interesting project for someone wishing to master both the Arrow
format and C++ API and Parquet nested data encoding.

Separate from that, we could definitely have much better documentation
about different ways to construct nested data in Python.

thanks
Wes

On Thu, Feb 1, 2018 at 3:46 AM, Ishaan Joshi <ish...@apache.org> wrote:
> Wes and co.,
>
> First off, great project ! I was able to read the docs and get going in
> under a day, the APIs are super easy to use. That being said, I'm a tad
> stuck, and having exhausted google-fu, am here to assistance. I want to use
> pyarrow to write a nested dataset in parquet. The schema is quite complex,
> and I'm having difficulty getting going with arrays for nested data
> structures. For e.g, a column in my schema look like this:
>
> In [7]: schema
>
> Out[7]:
>
> cstruct: struct<field1: double, field2: struct<field1: string>, field3:
> list<item: int32>, field4: list<struct: struct<field1: int32>>>
>
>   child 0, field1: double
>
>   child 1, field2: struct<field1: string>
>
>       child 0, field1: string
>
>   child 2, field3: list<item: int32>
>
>       child 0, item: int32
>
>   child 3, field4: list<struct: struct<field1: int32>>
>
>       child 0, struct: struct<field1: int32>
>
>           child 0, field1: int32
>
> How would I go constructing a row with this type? I've been looking at
> StructArray and ListArray. I've found the following links during my
> research:
>
> * https://github.com/apache/arrow/issues/1217
>
> *
> https://stackoverflow.com/questions/45341182/nested-data-in-parquet-with-python
>
> *
> https://github.com/apache/arrow/commit/5c704bce42e3fa71ea4586368962d41173b3e17b
>
> I've managed to wrangle everything but ListArrays, e.g:
>
> field1_data = pa.array([1.1], type=pa.float64())
>
> field2_data = pa.StructArray.from_arrays(['field1'], [pa.array(['foo'],
> type=pa.string())])
>
> field3_data = pa.array([[1], [2]], type=pa.list_(pa.int32()))
>
> I've having trouble with field4:
>
> field4_struct = pa.StructArray.from_arrays(['field1'], [pa.array([1],
> type=pa.int32())])
>
> field4_data = pa.ListArray.from_arrays(??, field4_struct)
>
> In particular, what does the offset value mean, and how do I populate it?
>
> Thanks in advance for all the help.
>
> -- Ishaan

Reply via email to