Re: Enforcing Dataset Schema before pushing to HUDI

Vinoth Chandar Thu, 24 Sep 2020 15:19:08 -0700

Just  a thought. can't you do the validation on just df.schema?

On Sat, Sep 19, 2020 at 7:06 PM tanu dua <[email protected]> wrote:


> No we don’t want dataframe to be converted to schema and just needs a
> validation. Following logic which I mentioned earlier is the only way I
> could find in spark to validate but don’t find it quite effective as we are
> unnecessarily creating a dataframe
>
>  Dataset<Row> hudiDs = spark.createDataFrame(
>               dataset.select(columnNamesToSelect.stream().map(s -> new
> Column(s)).toArray(Column[]::new)).rdd(),
>               <STRUCT_TYPE>);
>
> On Sat, 19 Sep 2020 at 11:35 PM, Vinoth Chandar <[email protected]> wrote:
>
> > We could add support to validate the data frame against a schema string
> >
> > passed
> >
> > to the data source writer. I guess you want the dataframe to be also
> >
> > converted into the provided schema?
> >
> >
> >
> > On Tue, Sep 15, 2020 at 9:02 PM tanu dua <[email protected]> wrote:
> >
> >
> >
> > > Hmm but our use case has multiple schemas one for each dataset as each
> >
> > >
> >
> > > dataset is unique in our case and hence the need to validate the schema
> > for
> >
> > >
> >
> > > each dataset while writing.
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar <[email protected]>
> > wrote:
> >
> > >
> >
> > >
> >
> > >
> >
> > > > Hi,
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > Typically writing people use a single schema, thats probably why.
> > During
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > read however, you are dealing with files written by different writes
> > with
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > different schema.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > So the ability to pass in a schema is handy. Hope that makes sense
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > On Sat, Sep 12, 2020 at 12:38 PM tanu dua <[email protected]>
> > wrote:
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > Thanks Vinoth. Yes that’s always an option with me to validate
> > myself.
> >
> > > I
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > just wanted to confirm if Spark does it for me for all my datasets
> > and
> >
> > > I
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > wonder why they haven’t provided it for write but provided it for
> > read.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]>
> >
> > >
> >
> > > > wrote:
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > Hi,
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > IIUC, you want to be able to pass in a schema to write? AFAIK,
> > Spark
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > Datasource V1 atleast does not allow for passing in the schema.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > Hudi writing will just use the schema for the df you pass in.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > Just throwing it out there. can you write a step to drop all
> >
> > >
> >
> > > > unnecessary
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > columns before issuing the write i.e
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > df.map(funcToDropExtraCols()).write.format("hudi")
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > thanks
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > Vinoth
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]>
> > wrote:
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > Hi,
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > We are working on Dataset<Row> which goes through lot of
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > transformations
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > and then pushed to HUDI. Since HUDI follows evolving schema so
> > if I
> >
> > >
> >
> > > > am
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > going to add a new column it will allow to do so.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > When we write into HUDI using spark I don't see any option
> where
> > DS
> >
> > >
> >
> > > > is
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > validated against schema (StructType) which we have in read
> which
> >
> > > can
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > cause
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > us to write some unwanted columns specially in lower envs.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > For eg.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > READ has an option like
> >
> > >
> >
> > > > spark.read.format("hudi").schema(<STRUCT_TYPE>)
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > which validates the schema however
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > WRITE doesnt have schema option spark.write.format("hudi") so
> all
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > columns
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > go out without validation against schema.
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > The workaround that I cam up is to recreate the dataset again
> > with
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > schema
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > but I don't like it as it has an overhead. Do we have any other
> >
> > >
> >
> > > > better
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > option and am I missing something ?
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >  Dataset<Row> hudiDs = spark.createDataFrame(
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >               dataset.select(columnNamesToSelect.stream().map(s
> > ->
> >
> > >
> >
> > > > new
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > > Column(s)).toArray(Column[]::new)).rdd(),
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >               <STRUCT_TYPE>);
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> >
>

Re: Enforcing Dataset Schema before pushing to HUDI

Reply via email to