joao-parana opened a new issue, #6097:
URL: https://github.com/apache/iceberg/issues/6097
### Apache Iceberg version
1.0.0 (latest release)
### Query engine
_No response_
### Please describe the bug 🐞
Hi folks, I would like to report something that might be a problem with
partitioning based on the "identity" transform.
I created a schema like this:
```java
this.schema = new Schema(
required(1, "ts", Types.TimestampType.withoutZone()),
required(2, "hotel_id", Types.LongType.get()),
optional(3, "hotel_name", Types.StringType.get()),
required(4, "arrival_date", Types.DateType.get()),
required(5, "value", Types.DoubleType.get()));
```
So I append data in parquet files with 5 different partitions (in five
different tables):
1. unpartitioned
2. identity("hotel_name")
3. month("ts")
4. identity("hotel_name") AND month("ts")
5. day("ts")
I insert only one record into each of the tables, with their respective
partitioning, for testing purposes. When I list the results of the queries in
the tables I get the following:
```txt
001: Record(2022-10-30T21:00:50.929375, 1000, hotel_name-1000,
2023-01-01, 4.13)
--------------------------------------------------------------------------------------
002: Record(2022-10-30T21:00:53.238515, 1000, null, 2023-01-01, 4.13)
--------------------------------------------------------------------------------------
003: Record(2022-10-30T21:00:53.461455, 1000, hotel_name-1000,
2023-01-01, 4.13)
--------------------------------------------------------------------------------------
004: Record(2022-10-30T21:00:53.653993, 1000, null, 2023-01-01, 4.13)
--------------------------------------------------------------------------------------
005: Record(2022-10-30T21:00:53.843971, 1000, hotel_name-1000,
2023-01-01, 4.13)
```
Note that in cases 2 and 4 where I used "identity()" type partitioning **the
hotel_name column is NULL**.
BTW, the "DataFiles" in Parquet format were created with "SortOrder" shown
below:
```java
final SortOrder sortOrder = SortOrder.builderFor(schema)
.asc("ts", NullOrder.NULLS_FIRST)
.asc("hotel_name", NullOrder.NULLS_FIRST)
.build();
```
Also it is important to say that I made a query (with Python 3.10) in the
`root` directory of the catalog using `pyarrow` and `datafusion`.
This query correctly show the data for `hotel_name` for all parquet files,
including those that the query via **Iceberg Java API** showed null.
It follows from this that the Parquet files are correct.
My test in Python 3.10 is:
```python
d = '/tmp/iceberg-test-2/bookings/'
import datafusion
print(datafusion.__version__)
import pyarrow as pa
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_parquet("soma", d)
print(ctx.tables())
rb = ctx.sql("SELECT * FROM soma").collect()
t = pa.Table.from_batches(rb)
print(t.to_pydict())
```
And the result was:
```json
{ 'ts': [ datetime.datetime(2022, 10, 30, 19, 36, 45, 434510),
datetime.datetime(2022, 10, 30, 19, 36, 43, 137844), datetime.datetime(2022,
10, 30, 19, 36, 44, 902886), datetime.datetime(2022, 10, 30, 19, 36, 45,
155755), datetime.datetime(2022, 10, 30, 19, 36, 45, 688557) ], 'hotel_id': [
1000, 1000, 1000, 1000, 1000 ], 'hotel_name': [ 'hotel_name-1000',
'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000', 'hotel_name-1000' ],
'arrival_date': [ datetime.date(2023, 1, 1), datetime.date(2023, 1, 1),
datetime.date(2023, 1, 1), datetime.date(2023, 1, 1), datetime.date(2023, 1, 1)
], 'value': [ 4.13, 4.13, 4.13, 4.13, 4.13 ] }
```
Timestamp is in microseconds
The complete test code is here:
https://gist.github.com/joao-parana/2adbd97c70c701668cd5e778a92262ea
I'm using `1.0.0` version of **Iceberg Java API**
This issue was posted is Iceberg Slack Channel too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]