Re: [PR] fix table.delete()/overwrite() with null values [iceberg-python]

via GitHub Mon, 22 Jul 2024 14:34:16 -0700


syun64 commented on PR #955:
URL: https://github.com/apache/iceberg-python/pull/955#issuecomment-2243845630


   Proposed implementation is consistent with Spark Iceberg's behavior.
   
   For a given Iceberg table:
   ```
   spark.read.table("demo.tacocat.test_null").show()
   
   >> +----+----+
   >> |  id|data|
   >> +----+----+
   >> |   1|   a|
   >> |   2|   b|
   >> |   3|   c|
   >> |NULL|   d|
   >> |   5|   e|
   >> +----+----+
   ```
   
   Scan API ignore Null values unless null is specified in the predicate 
expression:
   ```
   spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2""").show()
   >> +---+----+
   >> | id|data|
   >> +---+----+
   >> |  1|   a|
   >> |  2|   b|
   >> +---+----+
   
   spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE not id > 2""").show()
   
   >> +---+----+
   >> | id|data|
   >> +---+----+
   >> |  3|   c|
   >> |  5|   e|
   >> +---+----+
   
   spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2 OR id is 
NULL""").show()
   
   >> +----+----+
   >> |  id|data|
   >> +----+----+
   >> |   3|   c|
   >> |NULL|   d|
   >> |   5|   e|
   >> +----+----+
   ```
   
   While the DELETE API avoids deleting nulls unless it is specified directly 
in the predicate expression.
   
   ```
   spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id == 2""")
   spark.read.table("demo.tacocat.test_null").show()
   
   >> +----+----+
   >> |  id|data|
   >> +----+----+
   >> |   1|   a|
   >> |   3|   c|
   >> |NULL|   d|
   >> |   5|   e|
   >> +----+----+
   
   spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2""")
   spark.read.table("demo.tacocat.test_null").show()
   
   >> +----+----+
   >> |  id|data|
   >> +----+----+
   >> |   3|   c|
   >> |NULL|   d|
   >> |   5|   e|
   >> +----+----+
   
   spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2 or id IS 
NULL""")
   spark.read.table("demo.tacocat.test_null").show()
   
   >> +---+----+
   >> | id|data|
   >> +---+----+
   >> |  3|   c|
   >> |  5|   e|
   >> +---+----+
   
   ```
   
   So I agree with @jqin61 's finding that we have to walk through the 
predicate expression to check if Nulls/NaNs were directly mentioned on delete 
to revert the expression correctly as proposed. Simple negation of 
`pyarrow.compute.Expression` will unfortunately yield the wrong outcome.
   
   
   ```
   import pyarrow as pa
   import pyarrow.compute as pc
   expr = (pc.field("a") == pc.scalar(3))
   tbl = pa.Table.from_pydict({"a": [1,2,3,None]})
   tbl.filter(~ expr)
   
   >> pyarrow.Table
   >> a: int64
   >> ----
   >> a: [[1,2]]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix table.delete()/overwrite() with null values [iceberg-python]

Reply via email to