Re: Enforcing Dataset Schema before pushing to HUDI

tanu dua Sat, 12 Sep 2020 12:39:10 -0700

Thanks Vinoth. Yes that’s always an option with me to validate myself. I
just wanted to confirm if Spark does it for me for all my datasets and I
wonder why they haven’t provided it for write but provided it for read.


On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]> wrote:

> Hi,
>
>
>
> IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
>
> Datasource V1 atleast does not allow for passing in the schema.
>
> Hudi writing will just use the schema for the df you pass in.
>
>
>
> Just throwing it out there. can you write a step to drop all unnecessary
>
> columns before issuing the write i.e
>
> df.map(funcToDropExtraCols()).write.format("hudi")
>
>
>
> thanks
>
> Vinoth
>
>
>
>
>
> On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote:
>
>
>
> > Hi,
>
> > We are working on Dataset<Row> which goes through lot of transformations
>
> > and then pushed to HUDI. Since HUDI follows evolving schema so if I am
>
> > going to add a new column it will allow to do so.
>
> >
>
> > When we write into HUDI using spark I don't see any option where DS is
>
> > validated against schema (StructType) which we have in read which can
> cause
>
> > us to write some unwanted columns specially in lower envs.
>
> >
>
> > For eg.
>
> > READ has an option like spark.read.format("hudi").schema(<STRUCT_TYPE>)
>
> > which validates the schema however
>
> > WRITE doesnt have schema option spark.write.format("hudi") so all columns
>
> > go out without validation against schema.
>
> >
>
> > The workaround that I cam up is to recreate the dataset again with schema
>
> > but I don't like it as it has an overhead. Do we have any other better
>
> > option and am I missing something ?
>
> >
>
> >  Dataset<Row> hudiDs = spark.createDataFrame(
>
> >               dataset.select(columnNamesToSelect.stream().map(s -> new
>
> > Column(s)).toArray(Column[]::new)).rdd(),
>
> >               <STRUCT_TYPE>);
>
> >
>
> >
>
> >
>
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to