I wonder. I am already working in a mixed rust/python project. Would it make sense to just use `from_pyarrow` and `to_pyarrow` [1] to handle this filtering on the rust side? I imagine that is maybe the best approach?
[1]: https://docs.rs/arrow/latest/arrow/pyarrow/trait.PyArrowConvert.html - db On Sat, Mar 11, 2023 at 11:28 PM Data Bytes <[email protected]> wrote: > Based on your gist, I came up with the following (which can be executed > after the end of what you had written, I removed the `if __name__ == > '__main__':` and executed with `python -i create_groups_table.py` for ease): > > ```py > # As an example, we start with the following groups of id1 values: > # [[10], [11], [12]] > # [[13, 14, 15], [16, 17, 18]] > # [[19, 20]] > > # And say we want to end with: > # [[10]] > # [[13, 14], [18]] > > flat_list = pyarrow.compute.list_flatten(group_table['groups']) > id1_lists = pyarrow.compute.struct_field(flat_list, 0) > id1_parent_indices = pyarrow.compute.list_parent_indices(id1_lists) # > these indices tell us how to regroup id1 values into lists > id1_arr = pyarrow.compute.list_flatten(id1_lists) > id2_arr = pyarrow.compute.struct_field(flat_list, 1) > > valid_id1 = np.array([10, 13, 14, 18], dtype=np.int64) > mask = pyarrow.compute.is_in(id1_arr, pyarrow.array(valid_id1)) > masked_id1_arr = pyarrow.compute.filter(id1_arr, mask) > masked_id1_parent_indices = pyarrow.compute.filter(id1_parent_indices, > mask) > > remaining_id1_parents, masked_id1_split_indices = > np.unique(masked_id1_parent_indices, return_index=True) > new_id1_lists = np.split(masked_id1_arr, masked_id1_split_indices[1:]) > > new_id2_arr = id2_arr.take(remaining_id1_parents) > > groups_parent_indices = > group_table['groups'].chunks[0].value_parent_indices() # these indices > tell us how to assign id1 lists to groups > retained_group_indices = groups_parent_indices.take(remaining_id1_parents) > unique_retained_group_indices, retained_group_indices_splits = > np.unique(retained_group_indices, return_index=True) > > new_group_id = group_table['group_id'].take(unique_retained_group_indices) > > new_groups_flat = pyarrow.compute.make_struct(new_id1_lists, new_id2_arr, > field_names=['id1', 'id2']) > new_groups = CreateGroups(np.split(new_groups_flat, > retained_group_indices_splits[1:])) > > new_table = CreateTable(new_group_id, new_groups) > ``` > > I think this accomplishes what I want, but I don't have a good sense about > how it can be made better. If you have ideas, let me know! > > - db > > On Fri, Mar 10, 2023 at 6:12 PM Aldrin <[email protected]> wrote: > >> The first solution that comes to mind is: >> * don't flatten the structure, just walk the structure >> * create a new version of each id1 list that contains the data you want >> to **keep** >> * construct a new table containing all of the old data, but replace the >> id1 structs with their reduced versions >> * after constructing the new table, you can delete the old id1 lists to >> free up memory >> >> I think using `take` on the id1 list will maintain the structure for you. >> The tricky part is definitely keeping the structure of the groups list. You >> could potentially use `value_parent_indices` ([1]) on the groups list to >> gather each resulting item (struct<modified id1 list, id2 struct>) into the >> correct ListArray structure. Essentially, you: >> * use `value_parent_indices` to create a list of lists >> * you filter each item's id1 list as needed >> * then you gather new items into the appropriate groups list (list of >> lists) based on `value_parent_indices` >> * construct table from old `group_ids` and new `groups` >> >> I'm not sure this is a good solution, but it seems vaguely like a >> reasonable baseline solution. I think you rely on functions on the id1 list >> to move data efficiently (or not move data). I think reconstructing as I >> mentioned above should mostly be zero-copy. >> >> This probably isn't easy to follow... so I decided to write some code in >> a gist [2]. I can get around to doing the filtering part and creating >> reduced id1 lists next week if it'd be helpful, but maybe the description >> above and the excerpt in the gist are collectively helpful enough? >> >> good luck! >> >> [1]: >> https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.value_parent_indices >> [2]: https://gist.github.com/drin/b8fe9c5bb239b27bece81d817c83f0f4 >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Fri, Mar 10, 2023 at 2:44 PM Data Bytes <[email protected]> wrote: >> >>> I have data in the following format: >>> >>> import pyarrow as pa >>> >>> import pyarrow.parquet as pq >>> >>> import pyarrow.compute as pc >>> >>> import numpy as np >>> >>> t = pq.read_table('/tmp/example.parquet') >>> >>> t.schema >>> group_id: double >>> groups: list<item: struct<id1: list<item: int64>, id2: struct<type: >>> string, value: string>>> >>> child 0, item: struct<id1: list<item: int64>, id2: struct<type: >>> string, value: string>> >>> child 0, id1: list<item: int64> >>> child 0, item: int64 >>> child 1, id2: struct<type: string, value: string> >>> child 0, type: string >>> child 1, value: string >>> >>> >>> I want to efficiently (i.e., fast with minimal copying) eliminate some >>> of the values in the id1 lists, and eliminate entire group items if no >>> values are left in the corresponding id1 list. >>> >>> So far I am able to do the following, which seems efficient: >>> >>> flat_list = pc.list_flatten(t['groups']) >>> >>> flat_list.type >>> StructType(struct<id1: list<item: int64>, id2: struct<type: string, >>> value: string>>) >>> >>> id1_lists = pc.struct_field(flat_list, 0) >>> >>> id1_lists.type >>> ListType(list<item: int64>) >>> >>> parent_indices = pc.list_parent_indices(id1_lists) >>> >>> id1_arr = pc.list_flatten(id1_lists) >>> >>> id1_arr.type >>> DataType(int64) >>> >>> >>> From here I am able to mask id1_arr and parent_indices: >>> >>> # for example, I typically have 0.1% of the data survive this step >>> >>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001, >>> 0.999]) >>> >>> masked_id1_arr = pc.filter(id1_arr, mask) >>> >>> masked_parent_indices = pc.filter(parent_indices, mask) >>> >>> len(id1_arr) >>> 1309610 >>> >>> len(masked_id1_arr) >>> 1343 >>> >>> >>> But I am stuck on how to reconstruct the original schema efficiently. >>> Any ideas are appreciated! Example data file can be found here: >>> https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA >>> >>> - db >>> >>
