Re: Enforcing Dataset Schema before pushing to HUDI

Vinoth Chandar Sat, 19 Sep 2020 11:05:58 -0700

We could add support to validate the data frame against a schema string
passed
to the data source writer. I guess you want the dataframe to be also
converted into the provided schema?


On Tue, Sep 15, 2020 at 9:02 PM tanu dua <[email protected]> wrote:

> Hmm but our use case has multiple schemas one for each dataset as each
>
> dataset is unique in our case and hence the need to validate the schema for
>
> each dataset while writing.
>
>
>
> On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar <[email protected]> wrote:
>
>
>
> > Hi,
>
> >
>
> >
>
> >
>
> > Typically writing people use a single schema, thats probably why. During
>
> >
>
> > read however, you are dealing with files written by different writes with
>
> >
>
> > different schema.
>
> >
>
> > So the ability to pass in a schema is handy. Hope that makes sense
>
> >
>
> >
>
> >
>
> > On Sat, Sep 12, 2020 at 12:38 PM tanu dua <[email protected]> wrote:
>
> >
>
> >
>
> >
>
> > > Thanks Vinoth. Yes that’s always an option with me to validate myself.
> I
>
> >
>
> > > just wanted to confirm if Spark does it for me for all my datasets and
> I
>
> >
>
> > > wonder why they haven’t provided it for write but provided it for read.
>
> >
>
> > >
>
> >
>
> > > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]>
>
> > wrote:
>
> >
>
> > >
>
> >
>
> > > > Hi,
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > > IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
>
> >
>
> > > >
>
> >
>
> > > > Datasource V1 atleast does not allow for passing in the schema.
>
> >
>
> > > >
>
> >
>
> > > > Hudi writing will just use the schema for the df you pass in.
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > > Just throwing it out there. can you write a step to drop all
>
> > unnecessary
>
> >
>
> > > >
>
> >
>
> > > > columns before issuing the write i.e
>
> >
>
> > > >
>
> >
>
> > > > df.map(funcToDropExtraCols()).write.format("hudi")
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > > thanks
>
> >
>
> > > >
>
> >
>
> > > > Vinoth
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote:
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > > > > Hi,
>
> >
>
> > > >
>
> >
>
> > > > > We are working on Dataset<Row> which goes through lot of
>
> >
>
> > > transformations
>
> >
>
> > > >
>
> >
>
> > > > > and then pushed to HUDI. Since HUDI follows evolving schema so if I
>
> > am
>
> >
>
> > > >
>
> >
>
> > > > > going to add a new column it will allow to do so.
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > > > When we write into HUDI using spark I don't see any option where DS
>
> > is
>
> >
>
> > > >
>
> >
>
> > > > > validated against schema (StructType) which we have in read which
> can
>
> >
>
> > > > cause
>
> >
>
> > > >
>
> >
>
> > > > > us to write some unwanted columns specially in lower envs.
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > > > For eg.
>
> >
>
> > > >
>
> >
>
> > > > > READ has an option like
>
> > spark.read.format("hudi").schema(<STRUCT_TYPE>)
>
> >
>
> > > >
>
> >
>
> > > > > which validates the schema however
>
> >
>
> > > >
>
> >
>
> > > > > WRITE doesnt have schema option spark.write.format("hudi") so all
>
> >
>
> > > columns
>
> >
>
> > > >
>
> >
>
> > > > > go out without validation against schema.
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > > > The workaround that I cam up is to recreate the dataset again with
>
> >
>
> > > schema
>
> >
>
> > > >
>
> >
>
> > > > > but I don't like it as it has an overhead. Do we have any other
>
> > better
>
> >
>
> > > >
>
> >
>
> > > > > option and am I missing something ?
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > > >  Dataset<Row> hudiDs = spark.createDataFrame(
>
> >
>
> > > >
>
> >
>
> > > > >               dataset.select(columnNamesToSelect.stream().map(s ->
>
> > new
>
> >
>
> > > >
>
> >
>
> > > > > Column(s)).toArray(Column[]::new)).rdd(),
>
> >
>
> > > >
>
> >
>
> > > > >               <STRUCT_TYPE>);
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > > >
>
> >
>
> > > >
>
> >
>
> > > >
>
> >
>
> > >
>
> >
>
> >
>
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to