Re: Use Iceberg for a time-series data lake

Ryan Blue Fri, 18 Sep 2020 16:33:59 -0700

Hi Yi,

I think Iceberg could work for you without too much trouble.

You might want to look more into partitioning that Iceberg provides. I
agree that most users want the storage layer to handle partitioning for
them. That's exactly what Iceberg does: it makes data partitioning part of
table configuration and hides the concerns from users. Users just need to
tell Iceberg how to filter on timestamp, and it will automatically convert
the filters to match partitions and data files. You would probably want to
use the date/time partition transforms like days and hours. See the
partitioning doc for more details: https://iceberg.apache.org/partitioning

You would need to make sure you're writing data in sorted order, which is
dependent on how you're writing data. In Spark, you could use an ORDER BY
clause. And you would also need to build a reader that can merge, although
you could use Iceberg's existing Arrow support and possibly use your
existing merge code if it works with Arrow.

I don't think this is too much customization. Iceberg should provide many
of the building blocks you need. Please reach out to the dev list if you
have any more questions!

rb

On Wed, Sep 16, 2020 at 7:41 AM Yi Chen <yi.chen1...@gmail.com> wrote:

> Hi Iceberg Dev,
>
>
> We are looking into Iceberg for a data lake solution to replace a legacy
> system been there for many years. Our data(~10+PB in total) is time-series
> tabular data. We built a proof-of-concept earlier, which ended up with a
> very similar design like Iceberg, especially on the table spec.
>
>
> However, our use case has a few special requirements (supported by our
> legacy system) that are missing in Iceberg today:
>
>    - Our applications always expect sorted rows (by timestamp) when
>    reading the time-series data from the data lake.
>    - Our users do not want to deal with table partitioning. They expect
>    the storage layer (or the data-lake middle layer) to optimize the partition
>    for them.
>
> Our legacy system supports both by enforcing row order at write and having
> a background service that consolidates small data files into larger ones to
> optimize storage usage for better query performance. (The system does
> merge-on-read that resolves the intersected time ranges which have not been
> consolidated yet.) After we switch to Iceberg, to continue supporting the
> above features, it looks like we have to:
>
>    1. use a special partition spec that always creates a single partition
>    for any table,
>    2. build a background consolidation service on top of Iceberg's
>    compaction API
>    3. build a new writer (we use Arrow) that enforce write order.
>
> Would that be too much customization on top of what Iceberg has today? Or
> do you even consider this as a legitimate use case for Iceberg in the
> future?
>
>
> We noticed many ongoing efforts around topics like SortOrder,
> Merge-on-Read, Row-delete, etc. that seem to be very relevant. We are happy
> to contribute to the community if our use case makes sense to Iceberg.
>
>
> Thanks,
>
> Yi
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Use Iceberg for a time-series data lake

Reply via email to