We could add support to validate the data frame against a schema string passed to the data source writer. I guess you want the dataframe to be also converted into the provided schema?
On Tue, Sep 15, 2020 at 9:02 PM tanu dua <[email protected]> wrote: > Hmm but our use case has multiple schemas one for each dataset as each > > dataset is unique in our case and hence the need to validate the schema for > > each dataset while writing. > > > > On Tue, 15 Sep 2020 at 2:53 AM, Vinoth Chandar <[email protected]> wrote: > > > > > Hi, > > > > > > > > > > > > Typically writing people use a single schema, thats probably why. During > > > > > > read however, you are dealing with files written by different writes with > > > > > > different schema. > > > > > > So the ability to pass in a schema is handy. Hope that makes sense > > > > > > > > > > > > On Sat, Sep 12, 2020 at 12:38 PM tanu dua <[email protected]> wrote: > > > > > > > > > > > > > Thanks Vinoth. Yes that’s always an option with me to validate myself. > I > > > > > > > just wanted to confirm if Spark does it for me for all my datasets and > I > > > > > > > wonder why they haven’t provided it for write but provided it for read. > > > > > > > > > > > > > > On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]> > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IIUC, you want to be able to pass in a schema to write? AFAIK, Spark > > > > > > > > > > > > > > > > Datasource V1 atleast does not allow for passing in the schema. > > > > > > > > > > > > > > > > Hudi writing will just use the schema for the df you pass in. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just throwing it out there. can you write a step to drop all > > > unnecessary > > > > > > > > > > > > > > > > columns before issuing the write i.e > > > > > > > > > > > > > > > > df.map(funcToDropExtraCols()).write.format("hudi") > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > thanks > > > > > > > > > > > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > We are working on Dataset<Row> which goes through lot of > > > > > > > transformations > > > > > > > > > > > > > > > > > and then pushed to HUDI. Since HUDI follows evolving schema so if I > > > am > > > > > > > > > > > > > > > > > going to add a new column it will allow to do so. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When we write into HUDI using spark I don't see any option where DS > > > is > > > > > > > > > > > > > > > > > validated against schema (StructType) which we have in read which > can > > > > > > > > cause > > > > > > > > > > > > > > > > > us to write some unwanted columns specially in lower envs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For eg. > > > > > > > > > > > > > > > > > READ has an option like > > > spark.read.format("hudi").schema(<STRUCT_TYPE>) > > > > > > > > > > > > > > > > > which validates the schema however > > > > > > > > > > > > > > > > > WRITE doesnt have schema option spark.write.format("hudi") so all > > > > > > > columns > > > > > > > > > > > > > > > > > go out without validation against schema. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The workaround that I cam up is to recreate the dataset again with > > > > > > > schema > > > > > > > > > > > > > > > > > but I don't like it as it has an overhead. Do we have any other > > > better > > > > > > > > > > > > > > > > > option and am I missing something ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dataset<Row> hudiDs = spark.createDataFrame( > > > > > > > > > > > > > > > > > dataset.select(columnNamesToSelect.stream().map(s -> > > > new > > > > > > > > > > > > > > > > > Column(s)).toArray(Column[]::new)).rdd(), > > > > > > > > > > > > > > > > > <STRUCT_TYPE>); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
