Hi Yi, I think Iceberg could work for you without too much trouble.
You might want to look more into partitioning that Iceberg provides. I agree that most users want the storage layer to handle partitioning for them. That's exactly what Iceberg does: it makes data partitioning part of table configuration and hides the concerns from users. Users just need to tell Iceberg how to filter on timestamp, and it will automatically convert the filters to match partitions and data files. You would probably want to use the date/time partition transforms like days and hours. See the partitioning doc for more details: https://iceberg.apache.org/partitioning You would need to make sure you're writing data in sorted order, which is dependent on how you're writing data. In Spark, you could use an ORDER BY clause. And you would also need to build a reader that can merge, although you could use Iceberg's existing Arrow support and possibly use your existing merge code if it works with Arrow. I don't think this is too much customization. Iceberg should provide many of the building blocks you need. Please reach out to the dev list if you have any more questions! rb On Wed, Sep 16, 2020 at 7:41 AM Yi Chen <yi.chen1...@gmail.com> wrote: > Hi Iceberg Dev, > > > We are looking into Iceberg for a data lake solution to replace a legacy > system been there for many years. Our data(~10+PB in total) is time-series > tabular data. We built a proof-of-concept earlier, which ended up with a > very similar design like Iceberg, especially on the table spec. > > > However, our use case has a few special requirements (supported by our > legacy system) that are missing in Iceberg today: > > - Our applications always expect sorted rows (by timestamp) when > reading the time-series data from the data lake. > - Our users do not want to deal with table partitioning. They expect > the storage layer (or the data-lake middle layer) to optimize the partition > for them. > > Our legacy system supports both by enforcing row order at write and having > a background service that consolidates small data files into larger ones to > optimize storage usage for better query performance. (The system does > merge-on-read that resolves the intersected time ranges which have not been > consolidated yet.) After we switch to Iceberg, to continue supporting the > above features, it looks like we have to: > > 1. use a special partition spec that always creates a single partition > for any table, > 2. build a background consolidation service on top of Iceberg's > compaction API > 3. build a new writer (we use Arrow) that enforce write order. > > Would that be too much customization on top of what Iceberg has today? Or > do you even consider this as a legitimate use case for Iceberg in the > future? > > > We noticed many ongoing efforts around topics like SortOrder, > Merge-on-Read, Row-delete, etc. that seem to be very relevant. We are happy > to contribute to the community if our use case makes sense to Iceberg. > > > Thanks, > > Yi > -- Ryan Blue Software Engineer Netflix