[I] Failed insert schema compatibility mismatch issue [hudi]

via GitHub Thu, 23 May 2024 08:42:01 -0700


SamarthRaval opened a new issue, #11277:
URL: https://github.com/apache/hudi/issues/11277


   **Describe the problem you faced**
   
   - I did bulk-insert operation for my data, which ran fine, but for incoming 
files I did insert operation [For incoming data there were few columns missing 
and few new columns added] but as per my understanding hudi could have handled 
that.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Did bulk-insert loaded dataset.
   2. Ran insert operation ran into - Failed insert schema compatibility check
   3. when done step-2 with bulk-insert it ran fine and expanded the schema.
   4. Adding configurations I used for it.
   
   ```
           ImmutableMap.Builder<String, String> hudiOptions = 
ImmutableMap.<String, String>builder()
                   .put("hoodie.table.name", tableName)
                   .put("hoodie.datasource.write.recordkey.field", "uniqueId")
                   .put("hoodie.datasource.write.precombine.field", "version")
                   .put("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name())
                   .put("hoodie.datasource.write.operation", operation)
                   .put("hoodie.combine.before.insert", "true")
                   .put("hoodie.datasource.write.keygenerator.class", 
SimpleKeyGenerator.class.getName())
   
                   .put("hoodie.bulkinsert.sort.mode", "GLOBAL_SORT")
                   .put("hoodie.copyonwrite.record.size.estimate", "50")
                   .put("hoodie.parquet.small.file.limit", "104857600")
                   .put("hoodie.parquet.max.file.size", "125829120")
   
                   .put("hoodie.write.set.null.for.missing.columns", "true")
                   .put("hoodie.datasource.write.reconcile.schema", "true")
   
                   .put("hoodie.datasource.write.partitionpath.field", 
PARTITION_COLUMN_NAME)
                   .put("hoodie.datasource.hive_sync.partition_fields", 
PARTITION_COLUMN_NAME)
   
                   .put("hoodie.datasource.hive_sync.enable", "true")
                   .put("hoodie.datasource.write.hive_style_partitioning", 
"true")
   
                   .put("hoodie.datasource.hive_sync.table", tableName)
                   .put("hoodie.datasource.hive_sync.database", hudiDatabase)
                   .put("hoodie.datasource.hive_sync.auto_create_database", 
"true")
                   .put("hoodie.datasource.hive_sync.support_timestamp", "true")
                   .put("hoodie.datasource.hive_sync.use_jdbc", "false")
                   .put("hoodie.datasource.hive_sync.mode", "hms")
                   
.put("hoodie.datasource.hive_sync.partition_extractor_class", 
MultiPartKeysValueExtractor.class.getName())
   
                   .put("hoodie.metadata.enable", "true")
                   .put("hoodie.meta.sync.metadata_file_listing", "true")
   
                   .put("hoodie.clean.automatic", "true")
                   .put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS")
                   .put("hoodie.cleaner.commits.retained", "30")
                   .put("hoodie.cleaner.parallelism", "1000")
   
                   .put("hoodie.archive.merge.enable", "true")
                   .put("hoodie.commits.archival.batch", "30")
   
                   .put("hoodie.write.concurrency.mode", 
"OPTIMISTIC_CONCURRENCY_CONTROL")
                   .put("hoodie.cleaner.policy.failed.writes", "LAZY")
                   
.put("hoodie.write.concurrency.early.conflict.detection.enable", "true")
   
                   .put("hoodie.write.lock.provider", 
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider")
                   .put("hoodie.write.lock.dynamodb.table", hudiLockTable)
                   .put("hoodie.write.lock.dynamodb.partition_key", 
warehouseTableName)
                   .put("hoodie.write.lock.dynamodb.region", 
AWSUtils.getCurrentRegion().getName())
                   .put("hoodie.write.lock.dynamodb.endpoint_url", 
String.format("dynamodb.%s.amazonaws.com", 
AWSUtils.getCurrentRegion().getName()))
                   .put("hoodie.write.lock.dynamodb.billing_mode", 
"PAY_PER_REQUEST");
   
           if (operation.equals("insert"))
           {
               
hudiOptions.put("hoodie.datasource.write.insert.drop.duplicates", "true");
           }
   
   ```
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version :  3.4.1
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : s3 
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   
   ```24/05/22 19:44:10 INFO SparkContext: Successfully stopped SparkContext
   Exception in thread "main" org.apache.hudi.exception.HoodieInsertException: 
Failed insert schema compatibility check
        at 
org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:868)
        at 
org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:165)
        at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:218)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:504)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:502)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:113)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:129)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:165)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:165)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:276)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:164)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:503)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:503)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:33)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:479)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:101)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:86)
        at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:151)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388)
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1075)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1167)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1176)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.exception.HoodieException: Failed to read 
schema/check compatibility for base path <S3 path>
        at 
org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:844)
        at 
org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:866)
        ... 60 more
   Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Column 
dropping is not allowed
   
   all schema comparisions 
   
        at 
org.apache.hudi.avro.AvroSchemaUtils.checkSchemaCompatible(AvroSchemaUtils.java:373)
        at 
org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:842)
        ... 61 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Failed insert schema compatibility mismatch issue [hudi]

Reply via email to