Thanks for bringing this up and for finding the cause. I think we should add an option to promote binary to string (Approach A). That sounds pretty reasonable overall. I think it would be great if Impala also produced correct Parquet files, but that's beyond our control and there's, no doubt, a ton of data already in that format.
This could also be part of our v3 work, where I think we intend to add binary to string type promotion to the format. On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy <borokna...@apache.org> wrote: > Hey Everyone, > > Thank you for raising this issue and reaching out to the Impala community. > > Let me clarify that the problem only happens when there is a legacy Hive > table written by Impala, which is then converted to Iceberg. When Impala > writes into an Iceberg table there is no problem with interoperability. > > The root cause is that Impala only supports the BINARY type recently. And > the STRING type could serve as a workaround to store binary data. This is > why Impala does not add the UTF8 annotation for STRING columns in legacy > Hive tables. (Again, for Iceberg tables Impala adds the UTF8 annotation.) > > Later, when the table is converted to Iceberg, the migration process does > not rewrite the datafiles. Neither Spark, neither Impala's own ALTER TABLE > CONVERT TO statement. > > My comments about the proposed solutions, and also adding another one, > (Approach C): > > Approach A (promote BINARY to UTF8 during reads): I think it makes sense. > The Parquet metadata also stores information about the writer, so if we > want this to be a very specific fix, we can check if the writer was indeed > Impala. > > Approach B (Impala should annotate STRING columns with UTF8): This > probably can only be fixed in a new major version of Impala. Impala > supports the BINARY type now, so I think it makes sense to limit the STRING > type to actual string data. This approach does not fix already written > files, as you already pointed out. > > Approach C: Migration job could copy data files but rewrite file metadata, > if needed. This makes migration slower, but it's probably still faster than > a CREATE TABLE AS SELECT. > > At Impala-side we surely need to update our docs about migration and > interoperability. > > Cheers, > Zoltan > > OpenInx <open...@gmail.com> ezt írta (időpont: 2023. dec. 26., K 7:40): > >> Hi dev >> >> Sensordata [1] had encountered an interesting Apache Impala & Iceberg bug >> in their real customer production environment. >> Their customers use Apache Impala to create a large mount of Apache Hive >> tables in HMS, and ingested PB-level dataset >> in their hive table (which were originally written by Apache Impala). In >> recent days, their customers migrated those Hive >> tables to Apache Iceberg tables, but failed to query their huge dataset in >> iceberg table format by using the Apache Spark. >> >> Jiajie Feng (from Sensordata) and I had wrote a simple demo to demonstrate >> this issue, for more details please see below: >> >> https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing >> >> We'd like to hear the feedback and suggestions from both the impala and >> iceberg community. I think both Jiajie and I would like >> to fix this issue if we had an aligned solution. >> >> Best Regards. >> >> 1. https://www.sensorsdata.com/en/ >> > -- Ryan Blue Tabular