Re: Enforcing Dataset Schema before pushing to HUDI

tanu dua Tue, 15 Sep 2020 21:02:45 -0700

Hmm but our use case has multiple schemas one for each dataset as each
dataset is unique in our case and hence the need to validate the schema for
each dataset while writing.


On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar <[email protected]> wrote:

> Hi,
>
>
>
> Typically writing people use a single schema, thats probably why. During
>
> read however, you are dealing with files written by different writes with
>
> different schema.
>
> So the ability to pass in a schema is handy. Hope that makes sense
>
>
>
> On Sat, Sep 12, 2020 at 12:38 PM tanu dua <[email protected]> wrote:
>
>
>
> > Thanks Vinoth. Yes that’s always an option with me to validate myself. I
>
> > just wanted to confirm if Spark does it for me for all my datasets and I
>
> > wonder why they haven’t provided it for write but provided it for read.
>
> >
>
> > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]>
> wrote:
>
> >
>
> > > Hi,
>
> > >
>
> > >
>
> > >
>
> > > IIUC, you want to be able to pass in a schema to write? AFAIK, Spark
>
> > >
>
> > > Datasource V1 atleast does not allow for passing in the schema.
>
> > >
>
> > > Hudi writing will just use the schema for the df you pass in.
>
> > >
>
> > >
>
> > >
>
> > > Just throwing it out there. can you write a step to drop all
> unnecessary
>
> > >
>
> > > columns before issuing the write i.e
>
> > >
>
> > > df.map(funcToDropExtraCols()).write.format("hudi")
>
> > >
>
> > >
>
> > >
>
> > > thanks
>
> > >
>
> > > Vinoth
>
> > >
>
> > >
>
> > >
>
> > >
>
> > >
>
> > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote:
>
> > >
>
> > >
>
> > >
>
> > > > Hi,
>
> > >
>
> > > > We are working on Dataset<Row> which goes through lot of
>
> > transformations
>
> > >
>
> > > > and then pushed to HUDI. Since HUDI follows evolving schema so if I
> am
>
> > >
>
> > > > going to add a new column it will allow to do so.
>
> > >
>
> > > >
>
> > >
>
> > > > When we write into HUDI using spark I don't see any option where DS
> is
>
> > >
>
> > > > validated against schema (StructType) which we have in read which can
>
> > > cause
>
> > >
>
> > > > us to write some unwanted columns specially in lower envs.
>
> > >
>
> > > >
>
> > >
>
> > > > For eg.
>
> > >
>
> > > > READ has an option like
> spark.read.format("hudi").schema(<STRUCT_TYPE>)
>
> > >
>
> > > > which validates the schema however
>
> > >
>
> > > > WRITE doesnt have schema option spark.write.format("hudi") so all
>
> > columns
>
> > >
>
> > > > go out without validation against schema.
>
> > >
>
> > > >
>
> > >
>
> > > > The workaround that I cam up is to recreate the dataset again with
>
> > schema
>
> > >
>
> > > > but I don't like it as it has an overhead. Do we have any other
> better
>
> > >
>
> > > > option and am I missing something ?
>
> > >
>
> > > >
>
> > >
>
> > > >  Dataset<Row> hudiDs = spark.createDataFrame(
>
> > >
>
> > > >               dataset.select(columnNamesToSelect.stream().map(s ->
> new
>
> > >
>
> > > > Column(s)).toArray(Column[]::new)).rdd(),
>
> > >
>
> > > >               <STRUCT_TYPE>);
>
> > >
>
> > > >
>
> > >
>
> > > >
>
> > >
>
> > > >
>
> > >
>
> > >
>
> >
>
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to