In order to begin prototyping, I would start with the following questions.
1) Does Iceberg need a sort spec?
- I would say yes
2) Should Iceberg allow users to define a sort spec only if the table is
bucketed?
- I would say no, as it seems valid to have partitioned and sorted
tables.
3) How should Iceberg encode sort specs?
- Option #1 is to rely on table properties, which will allow us to use
ALTER TABLE ... SET TBLPROPERTIES to configure sorting specs. However, I am not
sure it would be easy to encode non-trivial sort specs and track sort spec
evolution (if needed).
- Option #2 is to extend PartitionSpec to cover sorting as well. This
option will allow us to use transformations to encode non-trivial sorts and
won't require many changes to the codebase.
- Option #3 is to store SortSpec separately from PartitionSpec. This
will require more changes compared to Option #2 but can also give us extra
flexibility.
Each option has its own trade-offs, but I tend to think #2 is reasonable.
4) Which sort orders should Iceberg support?
- I think we have to be flexible and support adding more sort orders
later. In addition to what Owen said, we can add sorting based on
multi-dimensional space-filling curves in the future.
What do you think?
Thanks,
Anton
> On 1 Jul 2019, at 18:06, Owen O'Malley <[email protected]> wrote:
>
> My thought is just like Iceberg has to define partitioning and bucketing, it
> has to define a canonical sort order. In particular, we can’t afford to have
> Spark, Presto, and Hive writing files in different orders. I believe the
> right approach is to define a sort order as a series of columns where each
> column is either ascending or descending and defining the natural sort order
> for each type.
>
> The hard bit will be if we need to support non-natural sorts of strings. For
> example, if we need to support case-insensitive sorts or the different
> collations that databases support, I’d hope that we could start with the
> default of utf-8 byte ordering and expand as needed. If you are curious what
> the different collations look like -
> https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database
>
> <https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database>
> .
>
> .. Owen
>
>> On Jul 1, 2019, at 4:18 AM, Anton Okolnychyi <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hey folks,
>>
>> Iceberg users are advised not only to partition their data but also to sort
>> within partitions by columns in predicates in order to get the best
>> performance. Right now, this process is mostly manual and performed by users
>> before writing.
>> I am wondering if we should extend Iceberg metadata so that query engines
>> can do this automatically in the future. We already have `sortColumns` in
>> DataFile but they are not used.
>> Do we need a notion of sort columns in TableMetadata?
>> Spark’s sort spec is tightly coupled with bucketing and cannot be used
>> alone. However, it seems reasonable to have partitioned and sorted tables
>> without bucketing. How do we see this in Iceberg?
>> If we decide to have sort spec in the metadata, do we want to make it part
>> of PartitionSpec or have it separately?
>> Thanks,
>> Anton
>>
>