Hi, I created a IMPALA-12675 <https://issues.apache.org/jira/browse/IMPALA-12675> about annotating STRINGs with UTF8 by default. The code change should be trivial, but I'm afraid we will need to wait for a new major release with this (because users might store binary data in STRING columns, so it would be a breaking change for them). Until then users can set PARQUET_ANNOTATE_STRINGS_UTF8 for themselves.
Approach C: Yeah, if Approach A goes through then we don't really need to bother with this. Cheers, Zoltan On Wed, Jan 3, 2024 at 2:02 PM OpenInx <open...@gmail.com> wrote: > Thanks Zoltan and Ryan for your feedback. > > I think we all agreed that adding an option to promote BINARY to String > (Approach A) in flink/spark/hive reader sides to read those historic > dataset correctly written by impala on hive already. Besides that, > applying approach B to future Apache Impala releases also sounds reasonable > to me, I think we can also create a PR in apache impala repo at the same > time when applying approach A to iceberg repo. > > About approach C, I guess those parquet files will also need to be totally > rewritten although we are only trying to change those file metadata, which > may be costly. So I'm a bit hesitant to choose this approach. > > Jiafei and I will try to create two PRs for the two things (A and B), one > for apache iceberg repo and another one for apache impala repo. > > Best regards. > > On Tue, Jan 2, 2024 at 2:49 AM Ryan Blue <b...@tabular.io> wrote: > > > Thanks for bringing this up and for finding the cause. > > > > I think we should add an option to promote binary to string (Approach A). > > That sounds pretty reasonable overall. I think it would be great if > Impala > > also produced correct Parquet files, but that's beyond our control and > > there's, no doubt, a ton of data already in that format. > > > > This could also be part of our v3 work, where I think we intend to add > > binary to string type promotion to the format. > > > > On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy <borokna...@apache.org > > > > wrote: > > > >> Hey Everyone, > >> > >> Thank you for raising this issue and reaching out to the Impala > community. > >> > >> Let me clarify that the problem only happens when there is a legacy Hive > >> table written by Impala, which is then converted to Iceberg. When Impala > >> writes into an Iceberg table there is no problem with interoperability. > >> > >> The root cause is that Impala only supports the BINARY type recently. > And > >> the STRING type could serve as a workaround to store binary data. This > is > >> why Impala does not add the UTF8 annotation for STRING columns in legacy > >> Hive tables. (Again, for Iceberg tables Impala adds the UTF8 > annotation.) > >> > >> Later, when the table is converted to Iceberg, the migration process > does > >> not rewrite the datafiles. Neither Spark, neither Impala's own ALTER > TABLE > >> CONVERT TO statement. > >> > >> My comments about the proposed solutions, and also adding another one, > >> (Approach C): > >> > >> Approach A (promote BINARY to UTF8 during reads): I think it makes > sense. > >> The Parquet metadata also stores information about the writer, so if we > >> want this to be a very specific fix, we can check if the writer was > indeed > >> Impala. > >> > >> Approach B (Impala should annotate STRING columns with UTF8): This > >> probably can only be fixed in a new major version of Impala. Impala > >> supports the BINARY type now, so I think it makes sense to limit the > STRING > >> type to actual string data. This approach does not fix already written > >> files, as you already pointed out. > >> > >> Approach C: Migration job could copy data files but rewrite file > >> metadata, if needed. This makes migration slower, but it's probably > still > >> faster than a CREATE TABLE AS SELECT. > >> > >> At Impala-side we surely need to update our docs about migration and > >> interoperability. > >> > >> Cheers, > >> Zoltan > >> > >> OpenInx <open...@gmail.com> ezt írta (időpont: 2023. dec. 26., K 7:40): > >> > >>> Hi dev > >>> > >>> Sensordata [1] had encountered an interesting Apache Impala & Iceberg > bug > >>> in their real customer production environment. > >>> Their customers use Apache Impala to create a large mount of Apache > Hive > >>> tables in HMS, and ingested PB-level dataset > >>> in their hive table (which were originally written by Apache Impala). > >>> In > >>> recent days, their customers migrated those Hive > >>> tables to Apache Iceberg tables, but failed to query their huge dataset > >>> in > >>> iceberg table format by using the Apache Spark. > >>> > >>> Jiajie Feng (from Sensordata) and I had wrote a simple demo to > >>> demonstrate > >>> this issue, for more details please see below: > >>> > >>> > https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing > >>> > >>> We'd like to hear the feedback and suggestions from both the impala and > >>> iceberg community. I think both Jiajie and I would like > >>> to fix this issue if we had an aligned solution. > >>> > >>> Best Regards. > >>> > >>> 1. https://www.sensorsdata.com/en/ > >>> > >> > > > > -- > > Ryan Blue > > Tabular > > >