Re: Enforcing Dataset Schema before pushing to HUDI

tanu dua Sat, 19 Sep 2020 19:06:59 -0700

No we don’t want dataframe to be converted to schema and just needs a
validation. Following logic which I mentioned earlier is the only way I
could find in spark to validate but don’t find it quite effective as we are
unnecessarily creating a dataframe


 Dataset<Row> hudiDs = spark.createDataFrame(
              dataset.select(columnNamesToSelect.stream().map(s -> new
Column(s)).toArray(Column[]::new)).rdd(),
              <STRUCT_TYPE>);

On Sat, 19 Sep 2020 at 11:35 PM, Vinoth Chandar <[email protected]> wrote:

> We could add support to validate the data frame against a schema string
>
> passed
>
> to the data source writer. I guess you want the dataframe to be also
>
> converted into the provided schema?
>
>
>
> On Tue, Sep 15, 2020 at 9:02 PM tanu dua <[email protected]> wrote:
>
>
>
> > Hmm but our use case has multiple schemas one for each dataset as each
>
> >
>
> > dataset is unique in our case and hence the need to validate the schema
> for
>
> >
>
> > each dataset while writing.
>
> >
>
> >
>
> >
>
> > On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar <[email protected]>
> wrote:
>
> >
>
> >
>
> >
>
> > > Hi,
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > > Typically writing people use a single schema, thats probably why.
> During
>
> >
>
> > >
>
> >
>
> > > read however, you are dealing with files written by different writes
> with
>
> >
>
> > >
>
> >
>
> > > different schema.
>
> >
>
> > >
>
> >
>
> > > So the ability to pass in a schema is handy. Hope that makes sense
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > > On Sat, Sep 12, 2020 at 12:38 PM tanu dua <[email protected]>
> wrote:
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > > > Thanks Vinoth. Yes that’s always an option with me to validate
> myself.
>
> > I
>
> >
>
> > >
>
> >
>
> > > > just wanted to confirm if Spark does it for me for all my datasets
> and
>
> > I
>
> >
>
> > >
>
> >
>
> > > > wonder why they haven’t provided it for write but provided it for
> read.
>
> >
>
> > >
>
> >
>
> > > >
>
> >
>
> > >
>
> >
>
> > > > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]>
>
> >
>
> > > wrote:
>
> >
>
> > >
>
> >
>
> > > >
>
> >
>
> > >
>
> >
>
> > > > > Hi,
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > IIUC, you want to be able to pass in a schema to write? AFAIK,
> Spark
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Datasource V1 atleast does not allow for passing in the schema.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Hudi writing will just use the schema for the df you pass in.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Just throwing it out there. can you write a step to drop all
>
> >
>
> > > unnecessary
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > columns before issuing the write i.e
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > df.map(funcToDropExtraCols()).write.format("hudi")
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > thanks
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > Vinoth
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]>
> wrote:
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > Hi,
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > We are working on Dataset<Row> which goes through lot of
>
> >
>
> > >
>
> >
>
> > > > transformations
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > and then pushed to HUDI. Since HUDI follows evolving schema so
> if I
>
> >
>
> > > am
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > going to add a new column it will allow to do so.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > When we write into HUDI using spark I don't see any option where
> DS
>
> >
>
> > > is
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > validated against schema (StructType) which we have in read which
>
> > can
>
> >
>
> > >
>
> >
>
> > > > > cause
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > us to write some unwanted columns specially in lower envs.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > For eg.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > READ has an option like
>
> >
>
> > > spark.read.format("hudi").schema(<STRUCT_TYPE>)
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > which validates the schema however
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > WRITE doesnt have schema option spark.write.format("hudi") so all
>
> >
>
> > >
>
> >
>
> > > > columns
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > go out without validation against schema.
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > The workaround that I cam up is to recreate the dataset again
> with
>
> >
>
> > >
>
> >
>
> > > > schema
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > but I don't like it as it has an overhead. Do we have any other
>
> >
>
> > > better
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > option and am I missing something ?
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >  Dataset<Row> hudiDs = spark.createDataFrame(
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >               dataset.select(columnNamesToSelect.stream().map(s
> ->
>
> >
>
> > > new
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > > Column(s)).toArray(Column[]::new)).rdd(),
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >               <STRUCT_TYPE>);
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > > >
>
> >
>
> > >
>
> >
>
> > > >
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> >
>
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to