Re: Hive table compatibility for Iceberg readers

Ryan Blue Fri, 11 Feb 2022 08:48:00 -0800

Sounds great. Thanks for the update! That PR is on my list to take a look
at, but I still recommend starting with the spec changes. For example, how
should default values be stored in Iceberg metadata for each type?
Currently, the spec changes just mention defaults without going into detail
about how they are tracked and what rules there are about them.


On Wed, Feb 9, 2022 at 6:32 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Ryan and Owen! Glad we have converged on this. Next steps for us:
>
> * Continuing the discussion on the default value PR (already ongoing [1]).
> * Filing the union type conversion PR (ETA end of next week).
> * Moving listing-based Hive table scan using Iceberg to a separate repo
> (likely open source). For this I expect introducing some extension points
> to Iceberg such as making some classes SPI. I hope that the community is
> okay with that.
>
> By the way, Owen and I synced on the Hive casing behavior, and it is a bit
> more involved: Hive lowers the schema case for all fields (including nested
> fields) in the Avro case, but only lowers top-level field case and
> preserves inner field case for other formats (we experimented with ORC and
> Text). Hope this clarifies the confusion.
>
> [1] https://github.com/apache/iceberg/pull/2496
>
> Thanks,
> Walaa.
>
>
>
> On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Walaa, thanks for this list. I think most of these are definitely useful.
>> I think the best one to focus on first is the default values, since those
>> will make Iceberg tables behave more like standard SQL tables, which is the
>> goal.
>>
>> I'm really curious to learn more about #1, but I don't think that I have
>> enough detail to know whether it is something that fits in the Iceberg
>> project. At Netflix, we had an alternative implementation of Hive and Spark
>> tables (Spark tables are slightly different) that we similarly used. But we
>> didn't write to both at the same time.
>>
>> For the others, I'm interested in hearing what other people in the
>> community find valuable. I don't think I would use #2 or #3, for example.
>> That's because we already support a flag for case insensitive column
>> resolution that is well supported throughout Iceberg. If you wanted to use
>> alternative names, then I'd probably recommend just turning that on...
>> although that may not be an option depending on how you're working with a
>> table. It would work in Spark, though. This may be a better feature for
>> your system that is built on Iceberg.
>>
>> Reading unions as structs has come up a couple times so that seems like
>> people will want it. I think someone attempted to add this support in the
>> past, but ran into issues because the spec is clear that these are NOT
>> Iceberg files. There is no guarantee that other implementations will read
>> them and Iceberg cannot write them in this form. I'm fairly confident that
>> not allowing unions to be written is a good choice, but I would support
>> being able to read them.
>>
>> Ryan
>>
>> On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <owen.omal...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> *2. Iceberg schema lower casing:* Before Iceberg, when users read Hive
>>>> tables from Spark, the returned schema is lowercase since Hive stores all
>>>> metadata in lowercase mode. If users move to Iceberg, such readers could
>>>> break once Iceberg returns proper case schema. This feature is to add
>>>> lowercasing for backward compatibility with existing scripts. This feature
>>>> is added as an option and is not enabled by default.
>>>>
>>>
>>> This isn't quite correct. Hive lowercases top-level columns. It does not
>>> lowercase field names inside structs.
>>>
>>>
>>>> *3. Hive table proper casing:* conversely, we leverage the Avro schema
>>>> to supplement the lower case Hive schema when reading Hive tables. This is
>>>> useful if someone wants to still get proper cased schemas while still in
>>>> the Hive mode (to be forward-compatible with Iceberg). The same flag used
>>>> in (2) is used here.
>>>>
>>>
>>> Are there users of Avro schemas in Hive outside of LinkedIn? I've never
>>> seen it used. I don't think you should tie #2 and #3 together.
>>>
>>> Supporting default values and union types are useful extensions.
>>>
>>> .. Owen
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Hive table compatibility for Iceberg readers

Reply via email to