SamarthRaval opened a new issue, #11277: URL: https://github.com/apache/hudi/issues/11277
**Describe the problem you faced** - I did bulk-insert operation for my data, which ran fine, but for incoming files I did insert operation [For incoming data there were few columns missing and few new columns added] but as per my understanding hudi could have handled that. **To Reproduce** Steps to reproduce the behavior: 1. Did bulk-insert loaded dataset. 2. Ran insert operation ran into - Failed insert schema compatibility check 3. when done step-2 with bulk-insert it ran fine and expanded the schema. 4. Adding configurations I used for it. ``` ImmutableMap.Builder<String, String> hudiOptions = ImmutableMap.<String, String>builder() .put("hoodie.table.name", tableName) .put("hoodie.datasource.write.recordkey.field", "uniqueId") .put("hoodie.datasource.write.precombine.field", "version") .put("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()) .put("hoodie.datasource.write.operation", operation) .put("hoodie.combine.before.insert", "true") .put("hoodie.datasource.write.keygenerator.class", SimpleKeyGenerator.class.getName()) .put("hoodie.bulkinsert.sort.mode", "GLOBAL_SORT") .put("hoodie.copyonwrite.record.size.estimate", "50") .put("hoodie.parquet.small.file.limit", "104857600") .put("hoodie.parquet.max.file.size", "125829120") .put("hoodie.write.set.null.for.missing.columns", "true") .put("hoodie.datasource.write.reconcile.schema", "true") .put("hoodie.datasource.write.partitionpath.field", PARTITION_COLUMN_NAME) .put("hoodie.datasource.hive_sync.partition_fields", PARTITION_COLUMN_NAME) .put("hoodie.datasource.hive_sync.enable", "true") .put("hoodie.datasource.write.hive_style_partitioning", "true") .put("hoodie.datasource.hive_sync.table", tableName) .put("hoodie.datasource.hive_sync.database", hudiDatabase) .put("hoodie.datasource.hive_sync.auto_create_database", "true") .put("hoodie.datasource.hive_sync.support_timestamp", "true") .put("hoodie.datasource.hive_sync.use_jdbc", "false") .put("hoodie.datasource.hive_sync.mode", "hms") .put("hoodie.datasource.hive_sync.partition_extractor_class", MultiPartKeysValueExtractor.class.getName()) .put("hoodie.metadata.enable", "true") .put("hoodie.meta.sync.metadata_file_listing", "true") .put("hoodie.clean.automatic", "true") .put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS") .put("hoodie.cleaner.commits.retained", "30") .put("hoodie.cleaner.parallelism", "1000") .put("hoodie.archive.merge.enable", "true") .put("hoodie.commits.archival.batch", "30") .put("hoodie.write.concurrency.mode", "OPTIMISTIC_CONCURRENCY_CONTROL") .put("hoodie.cleaner.policy.failed.writes", "LAZY") .put("hoodie.write.concurrency.early.conflict.detection.enable", "true") .put("hoodie.write.lock.provider", "org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider") .put("hoodie.write.lock.dynamodb.table", hudiLockTable) .put("hoodie.write.lock.dynamodb.partition_key", warehouseTableName) .put("hoodie.write.lock.dynamodb.region", AWSUtils.getCurrentRegion().getName()) .put("hoodie.write.lock.dynamodb.endpoint_url", String.format("dynamodb.%s.amazonaws.com", AWSUtils.getCurrentRegion().getName())) .put("hoodie.write.lock.dynamodb.billing_mode", "PAY_PER_REQUEST"); if (operation.equals("insert")) { hudiOptions.put("hoodie.datasource.write.insert.drop.duplicates", "true"); } ``` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : s3 * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. **Stacktrace** ```24/05/22 19:44:10 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" org.apache.hudi.exception.HoodieInsertException: Failed insert schema compatibility check at org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:868) at org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:165) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:218) at org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:504) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:502) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:113) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:129) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:165) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:165) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:276) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:164) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:503) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:503) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:33) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:479) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:151) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1075) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1167) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1176) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hudi.exception.HoodieException: Failed to read schema/check compatibility for base path <S3 path> at org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:844) at org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:866) ... 60 more Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Column dropping is not allowed all schema comparisions at org.apache.hudi.avro.AvroSchemaUtils.checkSchemaCompatible(AvroSchemaUtils.java:373) at org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:842) ... 61 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org