Re: Spark cannot read iceberg tables which were originally written by Impala

OpenInx Wed, 03 Jan 2024 05:02:21 -0800

Thanks Zoltan and Ryan for your feedback.

I think we all agreed that adding an option to promote BINARY to String
(Approach A) in flink/spark/hive reader sides to read those historic
dataset correctly written by impala on hive already.  Besides that,
applying approach B to future Apache Impala releases also sounds reasonable
to me, I think we can also create a PR in apache impala repo at the same
time when applying approach A to iceberg repo.


About approach C, I guess those parquet files will also need to be totally
rewritten although we are only trying to change those file metadata, which
may be costly. So I'm a bit hesitant to choose this approach.

Jiafei and I will try to create two PRs for the two things (A and B), one
for apache iceberg repo and another one for apache impala repo.

Best regards.

On Tue, Jan 2, 2024 at 2:49 AM Ryan Blue <[email protected]> wrote:

> Thanks for bringing this up and for finding the cause.
>
> I think we should add an option to promote binary to string (Approach A).
> That sounds pretty reasonable overall. I think it would be great if Impala
> also produced correct Parquet files, but that's beyond our control and
> there's, no doubt, a ton of data already in that format.
>
> This could also be part of our v3 work, where I think we intend to add
> binary to string type promotion to the format.
>
> On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy <[email protected]>
> wrote:
>
>> Hey Everyone,
>>
>> Thank you for raising this issue and reaching out to the Impala community.
>>
>> Let me clarify that the problem only happens when there is a legacy Hive
>> table written by Impala, which is then converted to Iceberg. When Impala
>> writes into an Iceberg table there is no problem with interoperability.
>>
>> The root cause is that Impala only supports the BINARY type recently. And
>> the STRING type could serve as a workaround to store binary data. This is
>> why Impala does not add the UTF8 annotation for STRING columns in legacy
>> Hive tables. (Again, for Iceberg tables Impala adds the UTF8 annotation.)
>>
>> Later, when the table is converted to Iceberg, the migration process does
>> not rewrite the datafiles. Neither Spark, neither Impala's own ALTER TABLE
>> CONVERT TO statement.
>>
>> My comments about the proposed solutions, and also adding another one,
>> (Approach C):
>>
>> Approach A (promote BINARY to UTF8 during reads): I think it makes sense.
>> The Parquet metadata also stores information about the writer, so if we
>> want this to be a very specific fix, we can check if the writer was indeed
>> Impala.
>>
>> Approach B (Impala should annotate STRING columns with UTF8): This
>> probably can only be fixed in a new major version of Impala. Impala
>> supports the BINARY type now, so I think it makes sense to limit the STRING
>> type to actual string data. This approach does not fix already written
>> files, as you already pointed out.
>>
>> Approach C: Migration job could copy data files but rewrite file
>> metadata, if needed. This makes migration slower, but it's probably still
>> faster than a CREATE TABLE AS SELECT.
>>
>> At Impala-side we surely need to update our docs about migration and
>> interoperability.
>>
>> Cheers,
>>    Zoltan
>>
>> OpenInx <[email protected]> ezt írta (időpont: 2023. dec. 26., K 7:40):
>>
>>> Hi dev
>>>
>>> Sensordata [1] had encountered an interesting Apache Impala & Iceberg bug
>>> in their real customer production environment.
>>> Their customers use Apache Impala to create a large mount of Apache Hive
>>> tables in HMS, and ingested PB-level dataset
>>> in their hive table (which were originally written by Apache Impala).
>>>  In
>>> recent days,  their customers migrated those Hive
>>> tables to Apache Iceberg tables, but failed to query their huge dataset
>>> in
>>> iceberg table format by using the Apache Spark.
>>>
>>> Jiajie Feng (from Sensordata) and I had wrote a simple demo to
>>> demonstrate
>>> this issue, for more details please see below:
>>>
>>> https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing
>>>
>>> We'd like to hear the feedback and suggestions from both the impala and
>>> iceberg community. I think both Jiajie and I would like
>>> to fix this issue if we had an aligned solution.
>>>
>>> Best Regards.
>>>
>>> 1. https://www.sensorsdata.com/en/
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Spark cannot read iceberg tables which were originally written by Impala

Reply via email to