Re: Iceberg - Hive schema synchronization

2020-11-24 Thread Vivekanand Vellanki
Some of the conversions we are seeing are: - Decimal to Decimal; not just limited to increasing precision as with Iceberg - varchar to string - numeric type to numeric type (float to Decimal, double to Decimal, Decimal to double, etc) - - numeric type to string On Tue, Nov 2

Re: Bucket partitioning in addition to regular partitioning

2020-11-24 Thread Ryan Blue
Data needs to be clustered so the Iceberg writer receives data for one table partition at a time. If it isn't clustered, Iceberg would need to either keep multiple files open (for all unfinished partitions) or would need to close and open new files for the same partition resulting in small files.

Re: Bucket partitioning in addition to regular partitioning

2020-11-24 Thread Kruger, Scott
By “task receives data clustered by partition”, do you mean that I should repartition using the same colums I order by? For example: df .repartition(col(“category”), col(“ts”), expr(“iceberg_bucket16(id)”)) .orderBy(col(“category”), col(“ts”), expr(“iceberg_bucket16(id)”)) …or am I misunders

Re: Iceberg - Hive schema synchronization

2020-11-24 Thread Owen O'Malley
You left the complex types off of your list (struct, map, array, uniontype). All of them have natural mappings in Iceberg, except for uniontype. Interval is supported on output, but not as a column type. Unfortunately, we have some tables with uniontype, so we'll need a solution for how to deal wit

Re: Bucket partitioning in addition to regular partitioning

2020-11-24 Thread Ryan Blue
It should work if you use `ORDER BY category, ts, iceberg_bucket16(id)`. You just need to ensure that each task receives data clustered by partition. On Tue, Nov 24, 2020 at 7:25 AM Kruger, Scott wrote: > I did register the bucket UDF (you can see me using it in the examples), > and the docs wer

Re: Iceberg - Hive schema synchronization

2020-11-24 Thread Vivekanand Vellanki
One of the challenges we've had is that Hive is more flexible with schema evolution compared to Iceberg. Are you guys also looking at this aspect? On Tue, Nov 24, 2020 at 8:21 PM Peter Vary wrote: > Hi Team, > > With Shardul we had a longer discussion yesterday about the schema > synchronization

Re: Bucket partitioning in addition to regular partitioning

2020-11-24 Thread Kruger, Scott
I did register the bucket UDF (you can see me using it in the examples), and the docs were helpful to an extent, but the issue is that it only shows how to use bucketing when it’s the only partitioning scheme, not the innermost of a multi-level partitioning scheme. That’s what I’m having trouble

Iceberg - Hive schema synchronization

2020-11-24 Thread Peter Vary
Hi Team, With Shardul we had a longer discussion yesterday about the schema synchronization between Iceberg and Hive, and we thought that it would be good to ask the opinion of the greater community too. We can have 2 sources for the schemas. Hive table definition / schema Iceberg schema. If