[ https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vadim Mironov updated ARROW-14772: ---------------------------------- Description: While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe. Following code demonstrates the issue: {code:java} import numpy as np import os import pandas as pd # 1.3.4 import pyarrow as pa # 6.0.1 import random import shutil import string import tempfile from datetime import datetime, timedelta if __name__ == '__main__': # 1. generate random data frame day_count = 5 data_length = 10 numpy_random_gen = np.random.default_rng() label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)] partial_dfs = [] start_date = datetime.today().date() - timedelta(days=day_count) for date in (start_date + timedelta(n) for n in range(day_count)): date_array = pd.to_datetime(np.full(data_length, date)).date label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)]) value_array = numpy_random_gen.integers(low=1, high=500, size=data_length) partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array})) df = pd.concat(partial_dfs, ignore_index=True) print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}") # 2. persist data frame partitioned by date dataset_dir = tempfile.mkdtemp() df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label']) # 3. restore from parquet partitioned dataset restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[ ('date', '=', str(start_date))], use_legacy_dataset=False) print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}") group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum') print(group_by_df) shutil.rmtree(dataset_dir) {code} It correctly reports five unique dates upon random df generation and correctly reports only after reading back from parquet: {noformat} Unique dates after restore: 0 2021-11-13 Name: date, dtype: category Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']{noformat} Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear: {code:java} group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum') print(group_by_df) {code} With the following output: {noformat} date label val_sum 0 2021-11-13 04LOXJCH 494 1 2021-11-13 4QOZ321D 819 2 2021-11-13 GG6YO5FS 394 3 2021-11-13 J7ZD3LDS 203 4 2021-11-13 TFVIXE6L 164 5 2021-11-14 04LOXJCH 0 6 2021-11-14 4QOZ321D 0 7 2021-11-14 GG6YO5FS 0 8 2021-11-14 J7ZD3LDS 0 9 2021-11-14 TFVIXE6L 0 10 2021-11-15 04LOXJCH 0 11 2021-11-15 4QOZ321D 0 12 2021-11-15 GG6YO5FS 0 13 2021-11-15 J7ZD3LDS 0 14 2021-11-15 TFVIXE6L 0 15 2021-11-16 04LOXJCH 0 16 2021-11-16 4QOZ321D 0 17 2021-11-16 GG6YO5FS 0 18 2021-11-16 J7ZD3LDS 0 19 2021-11-16 TFVIXE6L 0 20 2021-11-17 04LOXJCH 0 21 2021-11-17 4QOZ321D 0 22 2021-11-17 GG6YO5FS 0 23 2021-11-17 J7ZD3LDS 0 24 2021-11-17 TFVIXE6L 0{noformat} Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation. was: While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe. Following code demonstrates the issue: {code:java} import numpy as np import os import pandas as pd # 1.3.4 import pyarrow as pa # 6.0.1 import random import shutil import string import tempfile from datetime import datetime, timedelta if __name__ == '__main__': # 1. generate random data frame day_count = 5 data_length = 10 numpy_random_gen = np.random.default_rng() label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)] partial_dfs = [] start_date = datetime.today().date() - timedelta(days=day_count) for date in (start_date + timedelta(n) for n in range(day_count)): date_array = pd.to_datetime(np.full(data_length, date)).date label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)]) value_array = numpy_random_gen.integers(low=1, high=500, size=data_length) partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array})) df = pd.concat(partial_dfs, ignore_index=True) print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}") # 2. persist data frame partitioned by date dataset_dir = tempfile.mkdtemp() df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label']) # 3. restore from parquet partitioned dataset restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[ ('date', '=', str(start_date))], use_legacy_dataset=False) print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}") group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum') print(group_by_df) shutil.rmtree(dataset_dir) {code} It correctly reports five unique dates upon random df generation and correctly reports only after reading back from parquet: {noformat} Unique dates after restore: 0 2021-11-13 Name: date, dtype: category Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']{noformat} Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear: {code:java} group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum') print(group_by_df) {code} {{ {noformat} }} date label val_sum 0 2021-11-13 04LOXJCH 494 1 2021-11-13 4QOZ321D 819 2 2021-11-13 GG6YO5FS 394 3 2021-11-13 J7ZD3LDS 203 4 2021-11-13 TFVIXE6L 164 5 2021-11-14 04LOXJCH 0 6 2021-11-14 4QOZ321D 0 7 2021-11-14 GG6YO5FS 0 8 2021-11-14 J7ZD3LDS 0 9 2021-11-14 TFVIXE6L 0 10 2021-11-15 04LOXJCH 0 11 2021-11-15 4QOZ321D 0 12 2021-11-15 GG6YO5FS 0 13 2021-11-15 J7ZD3LDS 0 14 2021-11-15 TFVIXE6L 0 15 2021-11-16 04LOXJCH 0 16 2021-11-16 4QOZ321D 0 17 2021-11-16 GG6YO5FS 0 18 2021-11-16 J7ZD3LDS 0 19 2021-11-16 TFVIXE6L 0 20 2021-11-17 04LOXJCH 0 21 2021-11-17 4QOZ321D 0 22 2021-11-17 GG6YO5FS 0 23 2021-11-17 J7ZD3LDS 0 24 2021-11-17 TFVIXE6L 0{{{}{}}} {{{noformat} }} {{Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.}} {{{{}}{}}}{{{{}}{}}} > [Python] unexpected content after groupby on a dataframe restored from > partitioned parquet with filters > ------------------------------------------------------------------------------------------------------- > > Key: ARROW-14772 > URL: https://issues.apache.org/jira/browse/ARROW-14772 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, Python > Affects Versions: 6.0.1 > Reporter: Vadim Mironov > Priority: Major > > While experimenting with the partitioned dataset persistence in parquet, I > stumbled upon an interesting feature (or bug?) where after restoring only a > certain partition and applying groupby I suddenly get all the filtered rows > in the dataframe. > Following code demonstrates the issue: > {code:java} > import numpy as np > import os > import pandas as pd # 1.3.4 > import pyarrow as pa # 6.0.1 > import random > import shutil > import string > import tempfile > from datetime import datetime, timedelta > if __name__ == '__main__': > # 1. generate random data frame > day_count = 5 > data_length = 10 > numpy_random_gen = np.random.default_rng() > label_choices = [''.join(random.choices(string.ascii_uppercase + > string.digits, k=8)) for _ in range(5)] > partial_dfs = [] > start_date = datetime.today().date() - timedelta(days=day_count) > for date in (start_date + timedelta(n) for n in range(day_count)): > date_array = pd.to_datetime(np.full(data_length, date)).date > label_array = np.full(data_length, [random.choice(label_choices) for > _ in range(data_length)]) > value_array = numpy_random_gen.integers(low=1, high=500, > size=data_length) > partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': > label_array, 'value': value_array})) > df = pd.concat(partial_dfs, ignore_index=True) > print(f"Unique dates before > restore:\n{df.drop_duplicates(subset='date')['date']}") > # 2. persist data frame partitioned by date > dataset_dir = tempfile.mkdtemp() > df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', > 'label']) > # 3. restore from parquet partitioned dataset > restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[ > ('date', '=', str(start_date))], use_legacy_dataset=False) > print(f"Unique dates after > restore:\n{restored_df.drop_duplicates(subset='date')['date']}") > group_by_df = restored_df.groupby(by=['date', > 'label'])['value'].sum().reset_index(name='val_sum') > print(group_by_df) > shutil.rmtree(dataset_dir) {code} > It correctly reports five unique dates upon random df generation and > correctly reports only after reading back from parquet: > {noformat} > Unique dates after restore: > 0 2021-11-13 > Name: date, dtype: category > Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', > '2021-11-16', '2021-11-17']{noformat} > Albeit it adds that there are 5 categories. When subsequently I perform a > groupby, all dates that were filtered out at read miracolously appear: > {code:java} > group_by_df = restored_df.groupby(by=['date', > 'label'])['value'].sum().reset_index(name='val_sum') > print(group_by_df) > {code} > With the following output: > {noformat} > date label val_sum > 0 2021-11-13 04LOXJCH 494 > 1 2021-11-13 4QOZ321D 819 > 2 2021-11-13 GG6YO5FS 394 > 3 2021-11-13 J7ZD3LDS 203 > 4 2021-11-13 TFVIXE6L 164 > 5 2021-11-14 04LOXJCH 0 > 6 2021-11-14 4QOZ321D 0 > 7 2021-11-14 GG6YO5FS 0 > 8 2021-11-14 J7ZD3LDS 0 > 9 2021-11-14 TFVIXE6L 0 > 10 2021-11-15 04LOXJCH 0 > 11 2021-11-15 4QOZ321D 0 > 12 2021-11-15 GG6YO5FS 0 > 13 2021-11-15 J7ZD3LDS 0 > 14 2021-11-15 TFVIXE6L 0 > 15 2021-11-16 04LOXJCH 0 > 16 2021-11-16 4QOZ321D 0 > 17 2021-11-16 GG6YO5FS 0 > 18 2021-11-16 J7ZD3LDS 0 > 19 2021-11-16 TFVIXE6L 0 > 20 2021-11-17 04LOXJCH 0 > 21 2021-11-17 4QOZ321D 0 > 22 2021-11-17 GG6YO5FS 0 > 23 2021-11-17 J7ZD3LDS 0 > 24 2021-11-17 TFVIXE6L 0{noformat} > Perhaps I am doing something incorrectly within read_parquet call or > something, but my expectation would be for filtered data just be gone after > the read operation. -- This message was sent by Atlassian Jira (v8.20.1#820001)