[ https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-14772: -------------------------------- Labels: scanner (was: ) > [Python] unexpected content after groupby on a dataframe restored from > partitioned parquet with filters > ------------------------------------------------------------------------------------------------------- > > Key: ARROW-14772 > URL: https://issues.apache.org/jira/browse/ARROW-14772 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python > Affects Versions: 6.0.1 > Reporter: Vadim Mironov > Priority: Major > Labels: scanner > > While experimenting with the partitioned dataset persistence in parquet, I > stumbled upon an interesting feature (or bug?) where after restoring only a > certain partition and applying groupby I suddenly get all the filtered rows > in the dataframe. > Following code demonstrates the issue: > {code:java} > import numpy as np > import os > import pandas as pd # 1.3.4 > import pyarrow as pa # 6.0.1 > import random > import shutil > import string > import tempfile > from datetime import datetime, timedelta > if __name__ == '__main__': > # 1. generate random data frame > day_count = 5 > data_length = 10 > numpy_random_gen = np.random.default_rng() > label_choices = [''.join(random.choices(string.ascii_uppercase + > string.digits, k=8)) for _ in range(5)] > partial_dfs = [] > start_date = datetime.today().date() - timedelta(days=day_count) > for date in (start_date + timedelta(n) for n in range(day_count)): > date_array = pd.to_datetime(np.full(data_length, date)).date > label_array = np.full(data_length, [random.choice(label_choices) for > _ in range(data_length)]) > value_array = numpy_random_gen.integers(low=1, high=500, > size=data_length) > partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': > label_array, 'value': value_array})) > df = pd.concat(partial_dfs, ignore_index=True) > print(f"Unique dates before > restore:\n{df.drop_duplicates(subset='date')['date']}") > # 2. persist data frame partitioned by date > dataset_dir = tempfile.mkdtemp() > df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', > 'label']) > # 3. restore from parquet partitioned dataset > restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[ > ('date', '=', str(start_date))], use_legacy_dataset=False) > print(f"Unique dates after > restore:\n{restored_df.drop_duplicates(subset='date')['date']}") > group_by_df = restored_df.groupby(by=['date', > 'label'])['value'].sum().reset_index(name='val_sum') > print(group_by_df) > shutil.rmtree(dataset_dir) {code} > It correctly reports five unique dates upon random df generation and > correctly reports only one after reading back from parquet: > {noformat} > Unique dates after restore: > 0 2021-11-13 > Name: date, dtype: category > Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', > '2021-11-16', '2021-11-17']{noformat} > Albeit it adds that there are 5 categories. When subsequently I perform a > groupby, all dates that were filtered out at read miracolously appear: > {code:java} > group_by_df = restored_df.groupby(by=['date', > 'label'])['value'].sum().reset_index(name='val_sum') > print(group_by_df) > {code} > With the following output: > {noformat} > date label val_sum > 0 2021-11-13 04LOXJCH 494 > 1 2021-11-13 4QOZ321D 819 > 2 2021-11-13 GG6YO5FS 394 > 3 2021-11-13 J7ZD3LDS 203 > 4 2021-11-13 TFVIXE6L 164 > 5 2021-11-14 04LOXJCH 0 > 6 2021-11-14 4QOZ321D 0 > 7 2021-11-14 GG6YO5FS 0 > 8 2021-11-14 J7ZD3LDS 0 > 9 2021-11-14 TFVIXE6L 0 > 10 2021-11-15 04LOXJCH 0 > 11 2021-11-15 4QOZ321D 0 > 12 2021-11-15 GG6YO5FS 0 > 13 2021-11-15 J7ZD3LDS 0 > 14 2021-11-15 TFVIXE6L 0 > 15 2021-11-16 04LOXJCH 0 > 16 2021-11-16 4QOZ321D 0 > 17 2021-11-16 GG6YO5FS 0 > 18 2021-11-16 J7ZD3LDS 0 > 19 2021-11-16 TFVIXE6L 0 > 20 2021-11-17 04LOXJCH 0 > 21 2021-11-17 4QOZ321D 0 > 22 2021-11-17 GG6YO5FS 0 > 23 2021-11-17 J7ZD3LDS 0 > 24 2021-11-17 TFVIXE6L 0{noformat} > Perhaps I am doing something incorrectly within read_parquet call or > something, but my expectation would be for filtered data just be gone after > the read operation. -- This message was sent by Atlassian Jira (v8.20.1#820001)