Thanks Vinoth. Yes that’s always an option with me to validate myself. I just wanted to confirm if Spark does it for me for all my datasets and I wonder why they haven’t provided it for write but provided it for read.
On Sat, 12 Sep 2020 at 9:02 PM, Vinoth Chandar <[email protected]> wrote: > Hi, > > > > IIUC, you want to be able to pass in a schema to write? AFAIK, Spark > > Datasource V1 atleast does not allow for passing in the schema. > > Hudi writing will just use the schema for the df you pass in. > > > > Just throwing it out there. can you write a step to drop all unnecessary > > columns before issuing the write i.e > > df.map(funcToDropExtraCols()).write.format("hudi") > > > > thanks > > Vinoth > > > > > > On Wed, Sep 9, 2020 at 6:59 AM Tanuj <[email protected]> wrote: > > > > > Hi, > > > We are working on Dataset<Row> which goes through lot of transformations > > > and then pushed to HUDI. Since HUDI follows evolving schema so if I am > > > going to add a new column it will allow to do so. > > > > > > When we write into HUDI using spark I don't see any option where DS is > > > validated against schema (StructType) which we have in read which can > cause > > > us to write some unwanted columns specially in lower envs. > > > > > > For eg. > > > READ has an option like spark.read.format("hudi").schema(<STRUCT_TYPE>) > > > which validates the schema however > > > WRITE doesnt have schema option spark.write.format("hudi") so all columns > > > go out without validation against schema. > > > > > > The workaround that I cam up is to recreate the dataset again with schema > > > but I don't like it as it has an overhead. Do we have any other better > > > option and am I missing something ? > > > > > > Dataset<Row> hudiDs = spark.createDataFrame( > > > dataset.select(columnNamesToSelect.stream().map(s -> new > > > Column(s)).toArray(Column[]::new)).rdd(), > > > <STRUCT_TYPE>); > > > > > > > > > > >
