[ https://issues.apache.org/jira/browse/ARROW-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534166#comment-17534166 ]
Alenka Frim edited comment on ARROW-16495 at 5/10/22 6:32 AM: -------------------------------------------------------------- Running locally on latest master I get the following (which should be the correct behaviour): {code:python} >>> import pandas as pd >>> import pyarrow.dataset as ds >>> df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())}) >>> df C 0 <NA> 1 <NA> 2 1 >>> df.to_parquet("test.pq") # Create a dataset >>> dataset = ds.dataset("test.pq") >>> fragments = [f for f in dataset.get_fragments()] # One fragment >>> fragments [<pyarrow.dataset.ParquetFileFragment path=test.pq>] >>> expr = ds.field("C").is_null() # Selects the rows that have null values in C >>> scanner = fragment.scanner(filter=expr) >>> scanner.count_rows() 2 # Selects the rows that do not have null values in C >>> scanner = fragment.scanner(filter=~expr) >>> scanner.count_rows() 1 >>> scanner.to_table() pyarrow.Table C: int64 ---- C: [[1]] {code} I am a bit confused as why would `is_null` remove null values? Apologies if I am misunderstanding the issue. was (Author: alenkaf): Running locally on latest master I get the following (which is the correct behaviour): {code:python} >>> import pandas as pd >>> import pyarrow.dataset as ds >>> df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())}) >>> df C 0 <NA> 1 <NA> 2 1 >>> df.to_parquet("test.pq") # Create a dataset >>> dataset = ds.dataset("test.pq") >>> fragments = [f for f in dataset.get_fragments()] # One fragment >>> fragments [<pyarrow.dataset.ParquetFileFragment path=test.pq>] >>> expr = ds.field("C").is_null() # Selects the rows that have null values in C >>> scanner = fragment.scanner(filter=expr) >>> scanner.count_rows() 2 # Selects the rows that do not have null values in C >>> scanner = fragment.scanner(filter=~expr) >>> scanner.count_rows() 1 >>> scanner.to_table() pyarrow.Table C: int64 ---- C: [[1]] {code} I am a bit confused as why would `is_null` remove null values? Apologies if I am misunderstanding the issue. > [Python] Scanner.count_rows() doesn't properly handle null expressions > ---------------------------------------------------------------------- > > Key: ARROW-16495 > URL: https://issues.apache.org/jira/browse/ARROW-16495 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 7.0.0 > Reporter: Nick Riasanovsky > Priority: Major > > Passing an expression filter with `is_null()` doesn't properly remove null > values, when computing row counts. I have reproduced this with both strings > and integer. Here is a reproducer. > > > > {code:java} > df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())}) > print(df) > df.to_parquet("test.pq") > > # Create a dataset > dataset = ds.dataset("test.pq") > fragments = [f for f in dataset.get_fragments()] > #There should just be 1 fragment. > fragment = fragments[0] > # Get the null row count > expr = ds.field("C").is_null() > scanner = fragment.scanner(filter=expr) > print(scanner.count_rows()) > {code} > > > I expect this print 2 as there are 2 NULL values. -- This message was sent by Atlassian Jira (v8.20.7#820007)