[ 
https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446679#comment-17446679
 ] 

Vadim Mironov commented on ARROW-14772:
---------------------------------------

Thanks a lot Weston. I see. It restores the date column as a categorical type 
with all possible values in the dictionary. On your point regarding API 
expectations - it's a fair point. Although as you can see from my initial 
confusion, I expected the filter parameter to act as a true filter and have no 
other data in the data frame other than what I specified. I am not sure whether 
this would be an expectation of majority of API users for partitioned datasets, 
but if so then perhaps the behavior where all categories are restored should be 
available only via a dedicated argument.

In terms of practicality of current behaviour, it would be a bit cumbersome to 
pepper the code working with these filtered dataframes with all sorts of 
safeguards to catch zeroes or NaNs against the rows that were supposed to be 
filtered out. Besides, if I would to use multiple filters with 'greather than' 
or 'in' operators, post processing of the dataframe would be complex and I 
essentially would need to redo the filter rule logic.

Another point which is rather minor compared to the considerations outlined 
above is the conversion of data column to categorical type upon recovery. Is my 
assumption correct that this would be true for any column that was used to 
partition dataset? If so I am not sure what practical implications it may have 
beyond the need to format dates to strings, but it may also be a source of few 
surprizes.

> [Python] unexpected content after groupby on a dataframe restored from 
> partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14772
>                 URL: https://issues.apache.org/jira/browse/ARROW-14772
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: Vadim Mironov
>            Priority: Major
>              Labels: scanner
>
> While experimenting with the partitioned dataset persistence in parquet, I 
> stumbled upon an interesting feature (or bug?) where after restoring only a 
> certain partition and applying groupby I suddenly get all the filtered rows 
> in the dataframe. 
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd  # 1.3.4
> import pyarrow as pa  # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
>     # 1. generate random data frame
>     day_count = 5
>     data_length = 10
>     numpy_random_gen = np.random.default_rng()
>     label_choices = [''.join(random.choices(string.ascii_uppercase + 
> string.digits, k=8)) for _ in range(5)]
>     partial_dfs = []
>     start_date = datetime.today().date() - timedelta(days=day_count)
>     for date in (start_date + timedelta(n) for n in range(day_count)):
>         date_array = pd.to_datetime(np.full(data_length, date)).date
>         label_array = np.full(data_length, [random.choice(label_choices) for 
> _ in range(data_length)])
>         value_array = numpy_random_gen.integers(low=1, high=500, 
> size=data_length)
>         partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': 
> label_array, 'value': value_array}))
>     df = pd.concat(partial_dfs, ignore_index=True)
>     print(f"Unique dates before 
> restore:\n{df.drop_duplicates(subset='date')['date']}")
>     # 2. persist data frame partitioned by date
>     dataset_dir = tempfile.mkdtemp()
>     df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 
> 'label'])
>     # 3. restore from parquet partitioned dataset
>     restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
>         ('date', '=', str(start_date))], use_legacy_dataset=False)
>     print(f"Unique dates after 
> restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
>     shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and 
> correctly reports only one after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0    2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
> '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a 
> groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
> {code}
> With the following output:
> {noformat}
>           date     label  val_sum
> 0   2021-11-13  04LOXJCH      494
> 1   2021-11-13  4QOZ321D      819
> 2   2021-11-13  GG6YO5FS      394
> 3   2021-11-13  J7ZD3LDS      203
> 4   2021-11-13  TFVIXE6L      164
> 5   2021-11-14  04LOXJCH        0
> 6   2021-11-14  4QOZ321D        0
> 7   2021-11-14  GG6YO5FS        0
> 8   2021-11-14  J7ZD3LDS        0
> 9   2021-11-14  TFVIXE6L        0
> 10  2021-11-15  04LOXJCH        0
> 11  2021-11-15  4QOZ321D        0
> 12  2021-11-15  GG6YO5FS        0
> 13  2021-11-15  J7ZD3LDS        0
> 14  2021-11-15  TFVIXE6L        0
> 15  2021-11-16  04LOXJCH        0
> 16  2021-11-16  4QOZ321D        0
> 17  2021-11-16  GG6YO5FS        0
> 18  2021-11-16  J7ZD3LDS        0
> 19  2021-11-16  TFVIXE6L        0
> 20  2021-11-17  04LOXJCH        0
> 21  2021-11-17  4QOZ321D        0
> 22  2021-11-17  GG6YO5FS        0
> 23  2021-11-17  J7ZD3LDS        0
> 24  2021-11-17  TFVIXE6L        0{noformat}
> Perhaps I am doing something incorrectly within read_parquet call or 
> something, but my expectation would be for filtered data just be gone after 
> the read operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to