Re: Enforcing Dataset Schema before pushing to HUDI

Vinoth Chandar Sat, 12 Sep 2020 08:32:19 -0700

Hi,

IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
Datasource V1 atleast does not allow for passing in the schema.
Hudi writing will just use the schema for the df you pass in.


Just throwing it out there. can you write a step to drop all unnecessary
columns before issuing the write i.e
df.map(funcToDropExtraCols()).write.format("hudi")

thanks
Vinoth


On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote:

> Hi,
> We are working on Dataset<Row> which goes through lot of transformations
> and then pushed to HUDI. Since HUDI follows evolving schema so if I am
> going to add a new column it will allow to do so.
>
> When we write into HUDI using spark I don't see any option where DS is
> validated against schema (StructType) which we have in read which can cause
> us to write some unwanted columns specially in lower envs.
>
> For eg.
> READ has an option like spark.read.format("hudi").schema(<STRUCT_TYPE>)
> which validates the schema however
> WRITE doesnt have schema option spark.write.format("hudi") so all columns
> go out without validation against schema.
>
> The workaround that I cam up is to recreate the dataset again with schema
> but I don't like it as it has an overhead. Do we have any other better
> option and am I missing something ?
>
>  Dataset<Row> hudiDs = spark.createDataFrame(
>               dataset.select(columnNamesToSelect.stream().map(s -> new
> Column(s)).toArray(Column[]::new)).rdd(),
>               <STRUCT_TYPE>);
>
>
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to