[ 
https://issues.apache.org/jira/browse/ARROW-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vadim Mironov updated ARROW-14772:
----------------------------------
    Description: 
While experimenting with the partitioned dataset persistence in parquet, I 
stumbled upon an interesting feature (or bug?) where after restoring only a 
certain partition and applying groupby I suddenly get all the filtered rows in 
the dataframe. 

Following code demonstrates the issue:
{code:java}
import numpy as np
import os
import pandas as pd  # 1.3.4
import pyarrow as pa  # 6.0.1
import random
import shutil
import string
import tempfile

from datetime import datetime, timedelta

if __name__ == '__main__':
    # 1. generate random data frame
    day_count = 5
    data_length = 10

    numpy_random_gen = np.random.default_rng()
    label_choices = [''.join(random.choices(string.ascii_uppercase + 
string.digits, k=8)) for _ in range(5)]
    partial_dfs = []

    start_date = datetime.today().date() - timedelta(days=day_count)
    for date in (start_date + timedelta(n) for n in range(day_count)):
        date_array = pd.to_datetime(np.full(data_length, date)).date

        label_array = np.full(data_length, [random.choice(label_choices) for _ 
in range(data_length)])

        value_array = numpy_random_gen.integers(low=1, high=500, 
size=data_length)

        partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': 
label_array, 'value': value_array}))

    df = pd.concat(partial_dfs, ignore_index=True)
    print(f"Unique dates before 
restore:\n{df.drop_duplicates(subset='date')['date']}")

    # 2. persist data frame partitioned by date
    dataset_dir = tempfile.mkdtemp()

    df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 
'label'])

    # 3. restore from parquet partitioned dataset
    restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
        ('date', '=', str(start_date))], use_legacy_dataset=False)
    print(f"Unique dates after 
restore:\n{restored_df.drop_duplicates(subset='date')['date']}")

    group_by_df = restored_df.groupby(by=['date', 
'label'])['value'].sum().reset_index(name='val_sum')
    print(group_by_df)

    shutil.rmtree(dataset_dir) {code}
It correctly reports five unique dates upon random df generation and correctly 
reports only after reading back from parquet:
{noformat}
Unique dates after restore:
0    2021-11-13
Name: date, dtype: category
Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
'2021-11-16', '2021-11-17']{noformat}
Albeit it adds that there are 5 categories. When subsequently I perform a 
groupby, all dates that were filtered out at read miracolously appear:
{code:java}
    group_by_df = restored_df.groupby(by=['date', 
'label'])['value'].sum().reset_index(name='val_sum')
    print(group_by_df)
{code}
With the following output:
{noformat}
          date     label  val_sum
0   2021-11-13  04LOXJCH      494
1   2021-11-13  4QOZ321D      819
2   2021-11-13  GG6YO5FS      394
3   2021-11-13  J7ZD3LDS      203
4   2021-11-13  TFVIXE6L      164
5   2021-11-14  04LOXJCH        0
6   2021-11-14  4QOZ321D        0
7   2021-11-14  GG6YO5FS        0
8   2021-11-14  J7ZD3LDS        0
9   2021-11-14  TFVIXE6L        0
10  2021-11-15  04LOXJCH        0
11  2021-11-15  4QOZ321D        0
12  2021-11-15  GG6YO5FS        0
13  2021-11-15  J7ZD3LDS        0
14  2021-11-15  TFVIXE6L        0
15  2021-11-16  04LOXJCH        0
16  2021-11-16  4QOZ321D        0
17  2021-11-16  GG6YO5FS        0
18  2021-11-16  J7ZD3LDS        0
19  2021-11-16  TFVIXE6L        0
20  2021-11-17  04LOXJCH        0
21  2021-11-17  4QOZ321D        0
22  2021-11-17  GG6YO5FS        0
23  2021-11-17  J7ZD3LDS        0
24  2021-11-17  TFVIXE6L        0{noformat}
Perhaps I am doing something incorrectly within read_parquet call or something, 
but my expectation would be for filtered data just be gone after the read 
operation.

  was:
While experimenting with the partitioned dataset persistence in parquet, I 
stumbled upon an interesting feature (or bug?) where after restoring only a 
certain partition and applying groupby I suddenly get all the filtered rows in 
the dataframe. 

Following code demonstrates the issue:
{code:java}
import numpy as np
import os
import pandas as pd  # 1.3.4
import pyarrow as pa  # 6.0.1
import random
import shutil
import string
import tempfile

from datetime import datetime, timedelta

if __name__ == '__main__':
    # 1. generate random data frame
    day_count = 5
    data_length = 10

    numpy_random_gen = np.random.default_rng()
    label_choices = [''.join(random.choices(string.ascii_uppercase + 
string.digits, k=8)) for _ in range(5)]
    partial_dfs = []

    start_date = datetime.today().date() - timedelta(days=day_count)
    for date in (start_date + timedelta(n) for n in range(day_count)):
        date_array = pd.to_datetime(np.full(data_length, date)).date

        label_array = np.full(data_length, [random.choice(label_choices) for _ 
in range(data_length)])

        value_array = numpy_random_gen.integers(low=1, high=500, 
size=data_length)

        partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': 
label_array, 'value': value_array}))

    df = pd.concat(partial_dfs, ignore_index=True)
    print(f"Unique dates before 
restore:\n{df.drop_duplicates(subset='date')['date']}")

    # 2. persist data frame partitioned by date
    dataset_dir = tempfile.mkdtemp()

    df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 
'label'])

    # 3. restore from parquet partitioned dataset
    restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
        ('date', '=', str(start_date))], use_legacy_dataset=False)
    print(f"Unique dates after 
restore:\n{restored_df.drop_duplicates(subset='date')['date']}")

    group_by_df = restored_df.groupby(by=['date', 
'label'])['value'].sum().reset_index(name='val_sum')
    print(group_by_df)

    shutil.rmtree(dataset_dir) {code}
It correctly reports five unique dates upon random df generation and correctly 
reports only after reading back from parquet:
{noformat}
Unique dates after restore:
0    2021-11-13
Name: date, dtype: category
Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
'2021-11-16', '2021-11-17']{noformat}
Albeit it adds that there are 5 categories. When subsequently I perform a 
groupby, all dates that were filtered out at read miracolously appear:
{code:java}
    group_by_df = restored_df.groupby(by=['date', 
'label'])['value'].sum().reset_index(name='val_sum')

    print(group_by_df)

{code}
{{
{noformat}
}}

          date     label  val_sum
0   2021-11-13  04LOXJCH      494
1   2021-11-13  4QOZ321D      819
2   2021-11-13  GG6YO5FS      394
3   2021-11-13  J7ZD3LDS      203
4   2021-11-13  TFVIXE6L      164
5   2021-11-14  04LOXJCH        0
6   2021-11-14  4QOZ321D        0
7   2021-11-14  GG6YO5FS        0
8   2021-11-14  J7ZD3LDS        0
9   2021-11-14  TFVIXE6L        0
10  2021-11-15  04LOXJCH        0
11  2021-11-15  4QOZ321D        0
12  2021-11-15  GG6YO5FS        0
13  2021-11-15  J7ZD3LDS        0
14  2021-11-15  TFVIXE6L        0
15  2021-11-16  04LOXJCH        0
16  2021-11-16  4QOZ321D        0
17  2021-11-16  GG6YO5FS        0
18  2021-11-16  J7ZD3LDS        0
19  2021-11-16  TFVIXE6L        0
20  2021-11-17  04LOXJCH        0
21  2021-11-17  4QOZ321D        0
22  2021-11-17  GG6YO5FS        0
23  2021-11-17  J7ZD3LDS        0
24  2021-11-17  TFVIXE6L        0{{{}{}}}

{{{noformat}
}}

{{Perhaps I am doing something incorrectly within read_parquet call or 
something, but my expectation would be for filtered data just be gone after the 
read operation.}}

{{{{}}{}}}{{{{}}{}}}


> [Python] unexpected content after groupby on a dataframe restored from 
> partitioned parquet with filters
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14772
>                 URL: https://issues.apache.org/jira/browse/ARROW-14772
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: Vadim Mironov
>            Priority: Major
>
> While experimenting with the partitioned dataset persistence in parquet, I 
> stumbled upon an interesting feature (or bug?) where after restoring only a 
> certain partition and applying groupby I suddenly get all the filtered rows 
> in the dataframe. 
> Following code demonstrates the issue:
> {code:java}
> import numpy as np
> import os
> import pandas as pd  # 1.3.4
> import pyarrow as pa  # 6.0.1
> import random
> import shutil
> import string
> import tempfile
> from datetime import datetime, timedelta
> if __name__ == '__main__':
>     # 1. generate random data frame
>     day_count = 5
>     data_length = 10
>     numpy_random_gen = np.random.default_rng()
>     label_choices = [''.join(random.choices(string.ascii_uppercase + 
> string.digits, k=8)) for _ in range(5)]
>     partial_dfs = []
>     start_date = datetime.today().date() - timedelta(days=day_count)
>     for date in (start_date + timedelta(n) for n in range(day_count)):
>         date_array = pd.to_datetime(np.full(data_length, date)).date
>         label_array = np.full(data_length, [random.choice(label_choices) for 
> _ in range(data_length)])
>         value_array = numpy_random_gen.integers(low=1, high=500, 
> size=data_length)
>         partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': 
> label_array, 'value': value_array}))
>     df = pd.concat(partial_dfs, ignore_index=True)
>     print(f"Unique dates before 
> restore:\n{df.drop_duplicates(subset='date')['date']}")
>     # 2. persist data frame partitioned by date
>     dataset_dir = tempfile.mkdtemp()
>     df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 
> 'label'])
>     # 3. restore from parquet partitioned dataset
>     restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[
>         ('date', '=', str(start_date))], use_legacy_dataset=False)
>     print(f"Unique dates after 
> restore:\n{restored_df.drop_duplicates(subset='date')['date']}")
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
>     shutil.rmtree(dataset_dir) {code}
> It correctly reports five unique dates upon random df generation and 
> correctly reports only after reading back from parquet:
> {noformat}
> Unique dates after restore:
> 0    2021-11-13
> Name: date, dtype: category
> Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', 
> '2021-11-16', '2021-11-17']{noformat}
> Albeit it adds that there are 5 categories. When subsequently I perform a 
> groupby, all dates that were filtered out at read miracolously appear:
> {code:java}
>     group_by_df = restored_df.groupby(by=['date', 
> 'label'])['value'].sum().reset_index(name='val_sum')
>     print(group_by_df)
> {code}
> With the following output:
> {noformat}
>           date     label  val_sum
> 0   2021-11-13  04LOXJCH      494
> 1   2021-11-13  4QOZ321D      819
> 2   2021-11-13  GG6YO5FS      394
> 3   2021-11-13  J7ZD3LDS      203
> 4   2021-11-13  TFVIXE6L      164
> 5   2021-11-14  04LOXJCH        0
> 6   2021-11-14  4QOZ321D        0
> 7   2021-11-14  GG6YO5FS        0
> 8   2021-11-14  J7ZD3LDS        0
> 9   2021-11-14  TFVIXE6L        0
> 10  2021-11-15  04LOXJCH        0
> 11  2021-11-15  4QOZ321D        0
> 12  2021-11-15  GG6YO5FS        0
> 13  2021-11-15  J7ZD3LDS        0
> 14  2021-11-15  TFVIXE6L        0
> 15  2021-11-16  04LOXJCH        0
> 16  2021-11-16  4QOZ321D        0
> 17  2021-11-16  GG6YO5FS        0
> 18  2021-11-16  J7ZD3LDS        0
> 19  2021-11-16  TFVIXE6L        0
> 20  2021-11-17  04LOXJCH        0
> 21  2021-11-17  4QOZ321D        0
> 22  2021-11-17  GG6YO5FS        0
> 23  2021-11-17  J7ZD3LDS        0
> 24  2021-11-17  TFVIXE6L        0{noformat}
> Perhaps I am doing something incorrectly within read_parquet call or 
> something, but my expectation would be for filtered data just be gone after 
> the read operation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to