Thanks Ryan. Reading one partition at a time sounds a logical thing to me in my case.
I cannot use a query engine for now. In that case, if IcebergGenerics still the best way to read via core API? On Thu, Mar 25, 2021 at 2:16 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Chen, > > Iceberg doesn't guarantee any order for records returned by > `IcebergGenerics`. If you want a specific order, I'd recommend using a > query engine to sort or to read a partition at a time and then sort within > that partition. > > Iceberg can't really guarantee order across files. The sort order files > are written with may change over time, and Iceberg will also use the lack > of a guarantee to work faster in some cases. For example, most job planning > is done by reading manifest files in parallel so there isn't an order that > data files are returned in. Iceberg will also pack files into tasks in most > cases (though not for `IcebergGenerics`) so files can be reordered > depending on size as well. > > On Thu, Mar 25, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> wrote: > >> Popping up the question. >> >> On Wed, Mar 24, 2021 at 2:01 PM Chen Song <chen.song...@gmail.com> wrote: >> >>> I want to clarify the ordering semantics (if deterministic) on >>> partitions returned when using iceberg core data API to read. >>> >>> Say I define a table with a *time* column and partition by *day(time)*, and >>> do the following writes. >>> >>> partition (day) time other data fields >>> 2020-10-01 2020-10-01 01:01:01 ... >>> 2020-10-01 2020-10-01 02:01:01 ... >>> 2020-10-02 2020-10-02 01:01:01 ... >>> 2020-10-02 2020-10-02 02:01:01 ... >>> >>> Then if I do read all using something like the following. >>> >>> IcebergGenerics.read(table).build(); >>> >>> I did see rows returned in the right order in terms of partitions. Then >>> if I append the same data again and read again. I see rows returned like. >>> >>> 2020-10-01 2020-10-01 01:01:01 ... >>> 2020-10-01 2020-10-01 02:01:01 ... >>> 2020-10-02 2020-10-02 01:01:01 ... >>> 2020-10-02 2020-10-02 02:01:01 ... >>> 2020-10-01 2020-10-01 01:01:01 ... >>> 2020-10-01 2020-10-01 02:01:01 ... >>> 2020-10-02 2020-10-02 01:01:01 ... >>> 2020-10-02 2020-10-02 02:01:01 ... >>> >>> In other words, the rows returned in the order first by commit time then >>> by partition *day*. If I want to ensure the data from partition >>> 2020-10-01 is always returned before 2020-10-02 in the above example, is >>> there a way to configure the reader to do that? I checked the reader API >>> and cannot seem to find a method to do that. >>> >>> Please be noted that I am NOT talking about sorting within a partition, >>> which I know that has to be enforced by the writer. >>> >>> -- >>> Chen Song >>> >>> >> >> -- >> Chen Song >> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Chen Song