Thanks Ryan. Reading one partition at a time sounds a logical thing to me
in my case.

I cannot use a query engine for now. In that case, if IcebergGenerics still
the best way to read via core API?

On Thu, Mar 25, 2021 at 2:16 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Chen,
>
> Iceberg doesn't guarantee any order for records returned by
> `IcebergGenerics`. If you want a specific order, I'd recommend using a
> query engine to sort or to read a partition at a time and then sort within
> that partition.
>
> Iceberg can't really guarantee order across files. The sort order files
> are written with may change over time, and Iceberg will also use the lack
> of a guarantee to work faster in some cases. For example, most job planning
> is done by reading manifest files in parallel so there isn't an order that
> data files are returned in. Iceberg will also pack files into tasks in most
> cases (though not for `IcebergGenerics`) so files can be reordered
> depending on size as well.
>
> On Thu, Mar 25, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> wrote:
>
>> Popping up the question.
>>
>> On Wed, Mar 24, 2021 at 2:01 PM Chen Song <chen.song...@gmail.com> wrote:
>>
>>> I want to clarify the ordering semantics (if deterministic) on
>>> partitions returned when using iceberg core data API to read.
>>>
>>> Say I define a table with a *time* column and partition by *day(time)*, and
>>> do the following writes.
>>>
>>> partition (day)    time                               other data fields
>>> 2020-10-01         2020-10-01 01:01:01    ...
>>> 2020-10-01         2020-10-01 02:01:01    ...
>>> 2020-10-02         2020-10-02 01:01:01    ...
>>> 2020-10-02         2020-10-02 02:01:01    ...
>>>
>>> Then if I do read all using something like the following.
>>>
>>>     IcebergGenerics.read(table).build();
>>>
>>> I did see rows returned in the right order in terms of partitions. Then
>>> if I append the same data again and read again. I see rows returned like.
>>>
>>> 2020-10-01         2020-10-01 01:01:01    ...
>>> 2020-10-01         2020-10-01 02:01:01    ...
>>> 2020-10-02         2020-10-02 01:01:01    ...
>>> 2020-10-02         2020-10-02 02:01:01    ...
>>> 2020-10-01         2020-10-01 01:01:01    ...
>>> 2020-10-01         2020-10-01 02:01:01    ...
>>> 2020-10-02         2020-10-02 01:01:01    ...
>>> 2020-10-02         2020-10-02 02:01:01    ...
>>>
>>> In other words, the rows returned in the order first by commit time then
>>> by partition *day*. If I want to ensure the data from partition
>>> 2020-10-01 is always returned before  2020-10-02 in the above example, is
>>> there a way to configure the reader to do that? I checked the reader API
>>> and cannot seem to find a method to do that.
>>>
>>> Please be noted that I am NOT talking about sorting within a partition,
>>> which I know that has to be enforced by the writer.
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>> --
>> Chen Song
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Chen Song

Reply via email to