Enforcing Dataset Schema before pushing to HUDI

Tanuj Wed, 09 Sep 2020 06:59:12 -0700

Hi,
We are working on Dataset<Row> which goes through lot of transformations and 
then pushed to HUDI. Since HUDI follows evolving schema so if I am going to add 
a new column it will allow to do so.


When we write into HUDI using spark I don't see any option where DS is 
validated against schema (StructType) which we have in read which can cause us 
to write some unwanted columns specially in lower envs.

For eg.
READ has an option like spark.read.format("hudi").schema(<STRUCT_TYPE>) which 
validates the schema however
WRITE doesnt have schema option spark.write.format("hudi") so all columns go 
out without validation against schema.

The workaround that I cam up is to recreate the dataset again with schema but I 
don't like it as it has an overhead. Do we have any other better option and am 
I missing something ?

 Dataset<Row> hudiDs = spark.createDataFrame(
              dataset.select(columnNamesToSelect.stream().map(s -> new 
Column(s)).toArray(Column[]::new)).rdd(),
              <STRUCT_TYPE>);

Enforcing Dataset Schema before pushing to HUDI

Reply via email to