[jira] [Comment Edited] (ARROW-16495) [Python] Scanner.count_rows() doesn't properly handle null expressions

Alenka Frim (Jira) Mon, 09 May 2022 23:33:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534166#comment-17534166
 ]


Alenka Frim edited comment on ARROW-16495 at 5/10/22 6:32 AM:
--------------------------------------------------------------

Running locally on latest master I get the following (which should be the 
correct behaviour):
{code:python}
>>> import pandas as pd
>>> import pyarrow.dataset as ds

>>> df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
>>> df
      C
0  <NA>
1  <NA>
2     1
>>> df.to_parquet("test.pq")

# Create a dataset
>>> dataset = ds.dataset("test.pq")
>>> fragments = [f for f in dataset.get_fragments()]

# One fragment
>>> fragments
[<pyarrow.dataset.ParquetFileFragment path=test.pq>]

>>> expr = ds.field("C").is_null()

# Selects the rows that have null values in C
>>> scanner = fragment.scanner(filter=expr)
>>> scanner.count_rows()
2

# Selects the rows that do not have null values in C
>>> scanner = fragment.scanner(filter=~expr)
>>> scanner.count_rows()
1
>>> scanner.to_table()
pyarrow.Table
C: int64
----
C: [[1]]
{code}
I am a bit confused as why would `is_null` remove null values? Apologies if I 
am misunderstanding the issue.


was (Author: alenkaf):
Running locally on latest master I get the following (which is the correct 
behaviour):
{code:python}
>>> import pandas as pd
>>> import pyarrow.dataset as ds

>>> df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
>>> df
      C
0  <NA>
1  <NA>
2     1
>>> df.to_parquet("test.pq")

# Create a dataset
>>> dataset = ds.dataset("test.pq")
>>> fragments = [f for f in dataset.get_fragments()]

# One fragment
>>> fragments
[<pyarrow.dataset.ParquetFileFragment path=test.pq>]

>>> expr = ds.field("C").is_null()

# Selects the rows that have null values in C
>>> scanner = fragment.scanner(filter=expr)
>>> scanner.count_rows()
2

# Selects the rows that do not have null values in C
>>> scanner = fragment.scanner(filter=~expr)
>>> scanner.count_rows()
1
>>> scanner.to_table()
pyarrow.Table
C: int64
----
C: [[1]]
{code}
I am a bit confused as why would `is_null` remove null values? Apologies if I 
am misunderstanding the issue.

> [Python] Scanner.count_rows() doesn't properly handle null expressions
> ----------------------------------------------------------------------
>
>                 Key: ARROW-16495
>                 URL: https://issues.apache.org/jira/browse/ARROW-16495
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Nick Riasanovsky
>            Priority: Major
>
> Passing an expression filter with `is_null()` doesn't properly remove null 
> values, when computing row counts. I have reproduced this with both strings 
> and integer. Here is a reproducer.
>  
>  
>  
> {code:java}
> df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
> print(df)
> df.to_parquet("test.pq")
>  
> # Create a dataset
> dataset = ds.dataset("test.pq")
> fragments = [f for f in dataset.get_fragments()]
> #There should just be 1 fragment.
> fragment = fragments[0]
> # Get the null row count
> expr = ds.field("C").is_null()
> scanner = fragment.scanner(filter=expr)
> print(scanner.count_rows())
> {code}
>  
>  
> I expect this print 2 as there are 2 NULL values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (ARROW-16495) [Python] Scanner.count_rows() doesn't properly handle null expressions

Reply via email to