Re: Spark cannot read iceberg tables which were originally written by Impala

Zoltán Borók-Nagy Wed, 03 Jan 2024 07:16:14 -0800

Hi,

I created a IMPALA-12675
<https://issues.apache.org/jira/browse/IMPALA-12675> about annotating
STRINGs with UTF8 by default. The code change should be trivial, but I'm
afraid we will need to wait for a new major release with this (because
users might store binary data in STRING columns, so it would be a breaking
change for them). Until then users can set PARQUET_ANNOTATE_STRINGS_UTF8
for themselves.


Approach C: Yeah, if Approach A goes through then we don't really need to
bother with this.

Cheers,
    Zoltan


On Wed, Jan 3, 2024 at 2:02 PM OpenInx <open...@gmail.com> wrote:

> Thanks Zoltan and Ryan for your feedback.
>
> I think we all agreed that adding an option to promote BINARY to String
> (Approach A) in flink/spark/hive reader sides to read those historic
> dataset correctly written by impala on hive already.  Besides that,
> applying approach B to future Apache Impala releases also sounds reasonable
> to me, I think we can also create a PR in apache impala repo at the same
> time when applying approach A to iceberg repo.
>
> About approach C, I guess those parquet files will also need to be totally
> rewritten although we are only trying to change those file metadata, which
> may be costly. So I'm a bit hesitant to choose this approach.
>
> Jiafei and I will try to create two PRs for the two things (A and B), one
> for apache iceberg repo and another one for apache impala repo.
>
> Best regards.
>
> On Tue, Jan 2, 2024 at 2:49 AM Ryan Blue <b...@tabular.io> wrote:
>
> > Thanks for bringing this up and for finding the cause.
> >
> > I think we should add an option to promote binary to string (Approach A).
> > That sounds pretty reasonable overall. I think it would be great if
> Impala
> > also produced correct Parquet files, but that's beyond our control and
> > there's, no doubt, a ton of data already in that format.
> >
> > This could also be part of our v3 work, where I think we intend to add
> > binary to string type promotion to the format.
> >
> > On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy <borokna...@apache.org
> >
> > wrote:
> >
> >> Hey Everyone,
> >>
> >> Thank you for raising this issue and reaching out to the Impala
> community.
> >>
> >> Let me clarify that the problem only happens when there is a legacy Hive
> >> table written by Impala, which is then converted to Iceberg. When Impala
> >> writes into an Iceberg table there is no problem with interoperability.
> >>
> >> The root cause is that Impala only supports the BINARY type recently.
> And
> >> the STRING type could serve as a workaround to store binary data. This
> is
> >> why Impala does not add the UTF8 annotation for STRING columns in legacy
> >> Hive tables. (Again, for Iceberg tables Impala adds the UTF8
> annotation.)
> >>
> >> Later, when the table is converted to Iceberg, the migration process
> does
> >> not rewrite the datafiles. Neither Spark, neither Impala's own ALTER
> TABLE
> >> CONVERT TO statement.
> >>
> >> My comments about the proposed solutions, and also adding another one,
> >> (Approach C):
> >>
> >> Approach A (promote BINARY to UTF8 during reads): I think it makes
> sense.
> >> The Parquet metadata also stores information about the writer, so if we
> >> want this to be a very specific fix, we can check if the writer was
> indeed
> >> Impala.
> >>
> >> Approach B (Impala should annotate STRING columns with UTF8): This
> >> probably can only be fixed in a new major version of Impala. Impala
> >> supports the BINARY type now, so I think it makes sense to limit the
> STRING
> >> type to actual string data. This approach does not fix already written
> >> files, as you already pointed out.
> >>
> >> Approach C: Migration job could copy data files but rewrite file
> >> metadata, if needed. This makes migration slower, but it's probably
> still
> >> faster than a CREATE TABLE AS SELECT.
> >>
> >> At Impala-side we surely need to update our docs about migration and
> >> interoperability.
> >>
> >> Cheers,
> >>    Zoltan
> >>
> >> OpenInx <open...@gmail.com> ezt írta (időpont: 2023. dec. 26., K 7:40):
> >>
> >>> Hi dev
> >>>
> >>> Sensordata [1] had encountered an interesting Apache Impala & Iceberg
> bug
> >>> in their real customer production environment.
> >>> Their customers use Apache Impala to create a large mount of Apache
> Hive
> >>> tables in HMS, and ingested PB-level dataset
> >>> in their hive table (which were originally written by Apache Impala).
> >>>  In
> >>> recent days,  their customers migrated those Hive
> >>> tables to Apache Iceberg tables, but failed to query their huge dataset
> >>> in
> >>> iceberg table format by using the Apache Spark.
> >>>
> >>> Jiajie Feng (from Sensordata) and I had wrote a simple demo to
> >>> demonstrate
> >>> this issue, for more details please see below:
> >>>
> >>>
> https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing
> >>>
> >>> We'd like to hear the feedback and suggestions from both the impala and
> >>> iceberg community. I think both Jiajie and I would like
> >>> to fix this issue if we had an aligned solution.
> >>>
> >>> Best Regards.
> >>>
> >>> 1. https://www.sensorsdata.com/en/
> >>>
> >>
> >
> > --
> > Ryan Blue
> > Tabular
> >
>

Re: Spark cannot read iceberg tables which were originally written by Impala

Reply via email to