joao-parana opened a new issue, #6097:
URL: https://github.com/apache/iceberg/issues/6097

   ### Apache Iceberg version
   
   1.0.0 (latest release)
   
   ### Query engine
   
   _No response_
   
   ### Please describe the bug 🐞
   
   Hi folks, I would like to report something that might be a problem with 
partitioning based on the "identity" transform.
   
   I created a schema like this:
   
   ```java
   this.schema = new Schema(
     required(1, "ts", Types.TimestampType.withoutZone()),
     required(2, "hotel_id", Types.LongType.get()),
     optional(3, "hotel_name", Types.StringType.get()),
     required(4, "arrival_date", Types.DateType.get()),
     required(5, "value", Types.DoubleType.get()));
   ```
   
   So I append data in parquet files with 5 different partitions (in five 
different tables):
   
   1. unpartitioned
   2. identity("hotel_name")
   3. month("ts")
   4. identity("hotel_name") AND month("ts")
   5. day("ts")
   
   I insert only one record into each of the tables, with their respective 
partitioning, for testing purposes. When I list the results of the queries in 
the tables I get the following:
   
   ```txt
   001:         Record(2022-10-30T21:00:50.929375, 1000, hotel_name-1000, 
2023-01-01, 4.13)
   
--------------------------------------------------------------------------------------
   002:         Record(2022-10-30T21:00:53.238515, 1000, null, 2023-01-01, 4.13)
   
--------------------------------------------------------------------------------------
   003:         Record(2022-10-30T21:00:53.461455, 1000, hotel_name-1000, 
2023-01-01, 4.13)
   
--------------------------------------------------------------------------------------
   004:         Record(2022-10-30T21:00:53.653993, 1000, null, 2023-01-01, 4.13)
   
--------------------------------------------------------------------------------------
   005:         Record(2022-10-30T21:00:53.843971, 1000, hotel_name-1000, 
2023-01-01, 4.13)
   ```
   
   Note that in cases 2 and 4 where I used "identity()" type partitioning **the 
hotel_name column is NULL**.
   
   BTW, the "DataFiles" in Parquet format were created with "SortOrder" shown 
below:
   
   ```java
   final SortOrder sortOrder = SortOrder.builderFor(schema)
     .asc("ts", NullOrder.NULLS_FIRST)
     .asc("hotel_name", NullOrder.NULLS_FIRST)
     .build();
   ```
   
   Also it is important to say that I made a query (with Python 3.10) in the 
`root` directory of the catalog using `pyarrow` and `datafusion`.  
   This query correctly show the data for `hotel_name` for all parquet files, 
including those that the query via **Iceberg Java API** showed null.
   It follows from this that the Parquet files are correct.
   
   My test in Python 3.10 is:
   
   ```python
   d = '/tmp/iceberg-test-2/bookings/'
   import datafusion
   print(datafusion.__version__)
   import pyarrow as pa
   from datafusion import SessionContext
   ctx = SessionContext()
   ctx.register_parquet("soma", d)
   print(ctx.tables())
   rb = ctx.sql("SELECT * FROM soma").collect()
   t = pa.Table.from_batches(rb)
   print(t.to_pydict())
   ```
   
   And the result was:
   ```json
   { 'ts': [ datetime.datetime(2022, 10, 30, 19, 36, 45, 434510), 
datetime.datetime(2022, 10, 30, 19, 36, 43, 137844), datetime.datetime(2022, 
10, 30, 19, 36, 44, 902886), datetime.datetime(2022, 10, 30, 19, 36, 45, 
155755), datetime.datetime(2022, 10, 30, 19, 36, 45, 688557) ], 'hotel_id': [ 
1000, 1000, 1000, 1000, 1000 ], 'hotel_name': [ 'hotel_name-1000', 
'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000' ], 
'arrival_date': [ datetime.date(2023, 1, 1), datetime.date(2023, 1, 1), 
datetime.date(2023, 1, 1), datetime.date(2023, 1, 1), datetime.date(2023, 1, 1) 
], 'value': [ 4.13, 4.13, 4.13, 4.13, 4.13 ] }
   ```
   
   Timestamp is in microseconds
   
   The complete test code is here: 
https://gist.github.com/joao-parana/2adbd97c70c701668cd5e778a92262ea
   
   I'm using `1.0.0` version of **Iceberg Java API**
   
   This issue was posted is Iceberg Slack Channel too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to