Re: Migrating plain parquet tables to iceberg

Kruger, Scott Wed, 04 Nov 2020 06:58:04 -0800

Whoops, forgot to CC mailing list.

Ah, the metrics code _does_ allow you to use a name mapping if you specify one 
in the call to ParquetUtil.fileMetrics, which is what we did. If you don’t, 
though, the mapping property from the table (if present) doesn’t appear to be 
used automatically.


To be clear, for semi-complicated reasons that probably don’t bear going into 
(unless you really want to know), we aren’t writing iceberg data directly (i.e. 
using DataFrameWriter.format(“iceberg”)), but rather writing plain parquet data 
and then adding it to the iceberg table post hoc. So I don’t think there’s a 
problem with the regular iceberg write path via spark.


From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, November 3, 2020 at 3:00 PM
To: "Kruger, Scott" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: Migrating plain parquet tables to iceberg

This message is from an external sender.
I thought that we had already updated the metrics code to use a name mapping. 
Sorry I was mistaken. Could you post a PR with your fix?

Glad it's working!

On Tue, Nov 3, 2020 at 11:51 AM Kruger, Scott 
<[email protected]<mailto:[email protected]>> wrote:
Awesome, this is working for us, although we had to modify our code to also use 
the NameMapping when grabbing parquet file metrics. Thanks!

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, October 30, 2020 at 5:55 PM
To: "[email protected]" <[email protected]>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Migrating plain parquet tables to iceberg

This message is from an external sender.

For existing tables that use name-based column resolution, you can add a 
name-to-id mapping that is applied when reading files with no field IDs. There 
is a utility to generate the name mapping from an existing schema (using the 
current names) and then you just need to store that in a table property.

NameMapping mapping = MappingUtil.create(table.schema());

table.updateProperties()

    .set("schema.name-mapping.default", NameMappingParser.toJson(mapping))

    .commit()

I think there is also an issue to add a name mapping by default when importing 
data.

On Fri, Oct 30, 2020 at 3:46 PM Kruger, Scott <[email protected]> 
wrote:
I’m looking to migrate a partitioned parquet table to use iceberg. The issue 
I’ve run into is that the column order for the data varies wildly, which isn’t 
a problem for us normally (we just set mergeSchemas=true when reading), but 
presents a problem with iceberg because the iceberg.schema field isn’t set in 
the parquet footer. Is there any way to migrate this data over without 
rewriting the entire dataset?


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Migrating plain parquet tables to iceberg

Reply via email to