Re: [Python] Reconstruct original schema after multiple uses of list_flatten

Data Bytes Sun, 12 Mar 2023 22:31:47 -0700

I wonder. I am already working in a mixed rust/python project. Would it
make sense to just use `from_pyarrow` and `to_pyarrow` [1] to handle this
filtering on the rust side? I imagine that is maybe the best approach?


[1]: https://docs.rs/arrow/latest/arrow/pyarrow/trait.PyArrowConvert.html

- db

On Sat, Mar 11, 2023 at 11:28 PM Data Bytes <[email protected]> wrote:

> Based on your gist, I came up with the following (which can be executed
> after the end of what you had written, I removed the `if __name__ ==
> '__main__':` and executed with `python -i create_groups_table.py` for ease):
>
> ```py
> # As an example, we start with the following groups of id1 values:
> #   [[10], [11], [12]]
> #   [[13, 14, 15], [16, 17, 18]]
> #   [[19, 20]]
>
> # And say we want to end with:
> #   [[10]]
> #   [[13, 14], [18]]
>
> flat_list = pyarrow.compute.list_flatten(group_table['groups'])
> id1_lists = pyarrow.compute.struct_field(flat_list, 0)
> id1_parent_indices = pyarrow.compute.list_parent_indices(id1_lists)  #
> these indices tell us how to regroup id1 values into lists
> id1_arr = pyarrow.compute.list_flatten(id1_lists)
> id2_arr = pyarrow.compute.struct_field(flat_list, 1)
>
> valid_id1 = np.array([10, 13, 14, 18], dtype=np.int64)
> mask = pyarrow.compute.is_in(id1_arr, pyarrow.array(valid_id1))
> masked_id1_arr = pyarrow.compute.filter(id1_arr, mask)
> masked_id1_parent_indices = pyarrow.compute.filter(id1_parent_indices,
> mask)
>
> remaining_id1_parents, masked_id1_split_indices =
> np.unique(masked_id1_parent_indices, return_index=True)
> new_id1_lists = np.split(masked_id1_arr, masked_id1_split_indices[1:])
>
> new_id2_arr = id2_arr.take(remaining_id1_parents)
>
> groups_parent_indices =
> group_table['groups'].chunks[0].value_parent_indices()  # these indices
> tell us how to assign id1 lists to groups
> retained_group_indices = groups_parent_indices.take(remaining_id1_parents)
> unique_retained_group_indices, retained_group_indices_splits =
> np.unique(retained_group_indices, return_index=True)
>
> new_group_id = group_table['group_id'].take(unique_retained_group_indices)
>
> new_groups_flat = pyarrow.compute.make_struct(new_id1_lists, new_id2_arr,
> field_names=['id1', 'id2'])
> new_groups = CreateGroups(np.split(new_groups_flat,
> retained_group_indices_splits[1:]))
>
> new_table = CreateTable(new_group_id, new_groups)
> ```
>
> I think this accomplishes what I want, but I don't have a good sense about
> how it can be made better. If you have ideas, let me know!
>
> - db
>
> On Fri, Mar 10, 2023 at 6:12 PM Aldrin <[email protected]> wrote:
>
>> The first solution that comes to mind is:
>> * don't flatten the structure, just walk the structure
>> * create a new version of each id1 list that contains the data you want
>> to **keep**
>> * construct a new table containing all of the old data, but replace the
>> id1 structs with their reduced versions
>> * after constructing the new table, you can delete the old id1 lists to
>> free up memory
>>
>> I think using `take` on the id1 list will maintain the structure for you.
>> The tricky part is definitely keeping the structure of the groups list. You
>> could potentially use `value_parent_indices` ([1]) on the groups list to
>> gather each resulting item (struct<modified id1 list, id2 struct>) into the
>> correct ListArray structure. Essentially, you:
>> * use `value_parent_indices` to create a list of lists
>> * you filter each item's id1 list as needed
>> * then you gather new items into the appropriate groups list (list of
>> lists) based on `value_parent_indices`
>> * construct table from old `group_ids` and new `groups`
>>
>> I'm not sure this is a good solution, but it seems vaguely like a
>> reasonable baseline solution. I think you rely on functions on the id1 list
>> to move data efficiently (or not move data). I think reconstructing as I
>> mentioned above should mostly be zero-copy.
>>
>> This probably isn't easy to follow... so I decided to write some code in
>> a gist [2]. I can get around to doing the filtering part and creating
>> reduced id1 lists next week if it'd be helpful, but maybe the description
>> above and the excerpt in the gist are collectively helpful enough?
>>
>> good luck!
>>
>> [1]:
>> https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.value_parent_indices
>> [2]: https://gist.github.com/drin/b8fe9c5bb239b27bece81d817c83f0f4
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Fri, Mar 10, 2023 at 2:44 PM Data Bytes <[email protected]> wrote:
>>
>>> I have data in the following format:
>>> >>> import pyarrow as pa
>>> >>> import pyarrow.parquet as pq
>>> >>> import pyarrow.compute as pc
>>> >>> import numpy as np
>>> >>> t = pq.read_table('/tmp/example.parquet')
>>> >>> t.schema
>>> group_id: double
>>> groups: list<item: struct<id1: list<item: int64>, id2: struct<type:
>>> string, value: string>>>
>>>   child 0, item: struct<id1: list<item: int64>, id2: struct<type:
>>> string, value: string>>
>>>       child 0, id1: list<item: int64>
>>>           child 0, item: int64
>>>       child 1, id2: struct<type: string, value: string>
>>>           child 0, type: string
>>>           child 1, value: string
>>>
>>>
>>> I want to efficiently (i.e., fast with minimal copying) eliminate some
>>> of the values in the id1 lists, and eliminate entire group items if no
>>> values are left in the corresponding id1 list.
>>>
>>> So far I am able to do the following, which seems efficient:
>>> >>> flat_list = pc.list_flatten(t['groups'])
>>> >>> flat_list.type
>>> StructType(struct<id1: list<item: int64>, id2: struct<type: string,
>>> value: string>>)
>>> >>> id1_lists = pc.struct_field(flat_list, 0)
>>> >>> id1_lists.type
>>> ListType(list<item: int64>)
>>> >>> parent_indices = pc.list_parent_indices(id1_lists)
>>> >>> id1_arr = pc.list_flatten(id1_lists)
>>> >>> id1_arr.type
>>> DataType(int64)
>>>
>>>
>>> From here I am able to mask id1_arr and parent_indices:
>>> >>> # for example, I typically have 0.1% of the data survive this step
>>> >>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001,
>>> 0.999])
>>> >>> masked_id1_arr = pc.filter(id1_arr, mask)
>>> >>> masked_parent_indices = pc.filter(parent_indices, mask)
>>> >>> len(id1_arr)
>>> 1309610
>>> >>> len(masked_id1_arr)
>>> 1343
>>>
>>>
>>> But I am stuck on how to reconstruct the original schema efficiently.
>>> Any ideas are appreciated! Example data file can be found here:
>>> https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA
>>>
>>> - db
>>>
>>

Re: [Python] Reconstruct original schema after multiple uses of list_flatten

Reply via email to