Question on ordering on partitions when read

Chen Song Wed, 24 Mar 2021 11:01:46 -0700

I want to clarify the ordering semantics (if deterministic) on partitions
returned when using iceberg core data API to read.


Say I define a table with a *time* column and partition by *day(time)*, and
do the following writes.

partition (day)    time                               other data fields
2020-10-01         2020-10-01 01:01:01    ...
2020-10-01         2020-10-01 02:01:01    ...
2020-10-02         2020-10-02 01:01:01    ...
2020-10-02         2020-10-02 02:01:01    ...

Then if I do read all using something like the following.

    IcebergGenerics.read(table).build();

I did see rows returned in the right order in terms of partitions. Then if
I append the same data again and read again. I see rows returned like.

2020-10-01         2020-10-01 01:01:01    ...
2020-10-01         2020-10-01 02:01:01    ...
2020-10-02         2020-10-02 01:01:01    ...
2020-10-02         2020-10-02 02:01:01    ...
2020-10-01         2020-10-01 01:01:01    ...
2020-10-01         2020-10-01 02:01:01    ...
2020-10-02         2020-10-02 01:01:01    ...
2020-10-02         2020-10-02 02:01:01    ...

In other words, the rows returned in the order first by commit time then by
partition *day*. If I want to ensure the data from partition 2020-10-01 is
always returned before  2020-10-02 in the above example, is there a way to
configure the reader to do that? I checked the reader API and cannot seem
to find a method to do that.

Please be noted that I am NOT talking about sorting within a partition,
which I know that has to be enforced by the writer.

-- 
Chen Song

Question on ordering on partitions when read

Reply via email to