Re: Enforcing Dataset Schema before pushing to HUDI

Vinoth Chandar Mon, 14 Sep 2020 14:24:43 -0700

Hi,

Typically writing people use a single schema, thats probably why. During
read however, you are dealing with files written by different writes with
different schema.
So the ability to pass in a schema is handy. Hope that makes sense


On Sat, Sep 12, 2020 at 12:38 PM tanu dua <[email protected]> wrote:

> Thanks Vinoth. Yes that’s always an option with me to validate myself. I
> just wanted to confirm if Spark does it for me for all my datasets and I
> wonder why they haven’t provided it for write but provided it for read.
>
> On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]> wrote:
>
> > Hi,
> >
> >
> >
> > IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
> >
> > Datasource V1 atleast does not allow for passing in the schema.
> >
> > Hudi writing will just use the schema for the df you pass in.
> >
> >
> >
> > Just throwing it out there. can you write a step to drop all unnecessary
> >
> > columns before issuing the write i.e
> >
> > df.map(funcToDropExtraCols()).write.format("hudi")
> >
> >
> >
> > thanks
> >
> > Vinoth
> >
> >
> >
> >
> >
> > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote:
> >
> >
> >
> > > Hi,
> >
> > > We are working on Dataset<Row> which goes through lot of
> transformations
> >
> > > and then pushed to HUDI. Since HUDI follows evolving schema so if I am
> >
> > > going to add a new column it will allow to do so.
> >
> > >
> >
> > > When we write into HUDI using spark I don't see any option where DS is
> >
> > > validated against schema (StructType) which we have in read which can
> > cause
> >
> > > us to write some unwanted columns specially in lower envs.
> >
> > >
> >
> > > For eg.
> >
> > > READ has an option like spark.read.format("hudi").schema(<STRUCT_TYPE>)
> >
> > > which validates the schema however
> >
> > > WRITE doesnt have schema option spark.write.format("hudi") so all
> columns
> >
> > > go out without validation against schema.
> >
> > >
> >
> > > The workaround that I cam up is to recreate the dataset again with
> schema
> >
> > > but I don't like it as it has an overhead. Do we have any other better
> >
> > > option and am I missing something ?
> >
> > >
> >
> > >  Dataset<Row> hudiDs = spark.createDataFrame(
> >
> > >               dataset.select(columnNamesToSelect.stream().map(s -> new
> >
> > > Column(s)).toArray(Column[]::new)).rdd(),
> >
> > >               <STRUCT_TYPE>);
> >
> > >
> >
> > >
> >
> > >
> >
> >
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to