syun64 commented on PR #955:
URL: https://github.com/apache/iceberg-python/pull/955#issuecomment-2243845630
Proposed implementation is consistent with Spark Iceberg's behavior.
For a given Iceberg table:
```
spark.read.table("demo.tacocat.test_null").show()
>> +----+----+
>> | id|data|
>> +----+----+
>> | 1| a|
>> | 2| b|
>> | 3| c|
>> |NULL| d|
>> | 5| e|
>> +----+----+
```
Scan API ignore Null values unless null is specified in the predicate
expression:
```
spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2""").show()
>> +---+----+
>> | id|data|
>> +---+----+
>> | 1| a|
>> | 2| b|
>> +---+----+
spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE not id > 2""").show()
>> +---+----+
>> | id|data|
>> +---+----+
>> | 3| c|
>> | 5| e|
>> +---+----+
spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2 OR id is
NULL""").show()
>> +----+----+
>> | id|data|
>> +----+----+
>> | 3| c|
>> |NULL| d|
>> | 5| e|
>> +----+----+
```
While the DELETE API avoids deleting nulls unless it is specified directly
in the predicate expression.
```
spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id == 2""")
spark.read.table("demo.tacocat.test_null").show()
>> +----+----+
>> | id|data|
>> +----+----+
>> | 1| a|
>> | 3| c|
>> |NULL| d|
>> | 5| e|
>> +----+----+
spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2""")
spark.read.table("demo.tacocat.test_null").show()
>> +----+----+
>> | id|data|
>> +----+----+
>> | 3| c|
>> |NULL| d|
>> | 5| e|
>> +----+----+
spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2 or id IS
NULL""")
spark.read.table("demo.tacocat.test_null").show()
>> +---+----+
>> | id|data|
>> +---+----+
>> | 3| c|
>> | 5| e|
>> +---+----+
```
So I agree with @jqin61 's finding that we have to walk through the
predicate expression to check if Nulls/NaNs were directly mentioned on delete
to revert the expression correctly as proposed. Simple negation of
`pyarrow.compute.Expression` will unfortunately yield the wrong outcome.
```
import pyarrow as pa
import pyarrow.compute as pc
expr = (pc.field("a") == pc.scalar(3))
tbl = pa.Table.from_pydict({"a": [1,2,3,None]})
tbl.filter(~ expr)
>> pyarrow.Table
>> a: int64
>> ----
>> a: [[1,2]]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]