danielfordfc opened a new issue, #7049: URL: https://github.com/apache/hudi/issues/7049
We are using Hudi 0.11.0 Hudi Deltastreamer on emr-6.7.0 to read data in from our Confluent Kafka cluster w/ Schema registry , and write it to a Glue Catalog table to be queried through Athena. Our spark-submit command is as follows: ``` "spark-submit", "--jars", "/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar", "--class", "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer", "--conf", "spark.serializer=org.apache.spark.serializer.KryoSerializer", "--conf", "spark.sql.catalogImplementation=hive", "--conf", "spark.sql.hive.convertMetastoreParquet=false", "--conf", "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension", "--conf", "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog", "--conf", "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", "/usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.0-amzn-0.jar", "--props", f"s3://{bucket}/configs/{source}.properties", "--source-class", "org.apache.hudi.utilities.sources.AvroKafkaSource", "--target-base-path", f"s3://{bucket}/{source}/raw", "--target-table", source, "--schemaprovider-class", "org.apache.hudi.utilities.schema.SchemaRegistryProvider", "--transformer-class", "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer", "--source-ordering-field", "published_at", #"--enable-sync", ``` and our hudi properties file, that we've been trying to find the correct configuration for, is as follows: ``` #Example NonPartitionedGenerator config hoodie.datasource.hive_sync.database=pandas_raw hoodie.database.name=pandas_raw hoodie.table.name=funding_channel_failed_allocation hoodie.datasource.hive_sync.table=funding_channel_failed_allocation hoodie.deltastreamer.transformer.sql=select id, published_at FROM <SRC> hoodie.datasource.write.precombine.field=published_at hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator # hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor # hoodie.datasource.write.hive_style_partitioning=true hoodie.datasource.write.partitionpath.field='' hoodie.datasource.write.table.type=COPY_ON_WRITE hoodie.datasource.write.operation=UPSERT # hoodie.datasource.hive_sync.enable=true # hoodie.datasource.hive_sync.mode=hms # hoodie.datasource.hive_sync.sync_as_datasource=false # hoodie.datasource.hive_sync.use_jdbc=false # hoodie.datasource.hive_sync.use_pre_apache_input_format=true hoodie.avro.schema.validate=false hoodie.schema.on.read.enable=true ``` We've been commenting and uncommenting the above fields to try and find a combination that works. Our issue is that the `hoodie.deltastreamer.transformer.sql` statement doesn't appear to be having any effect on the output parquet? The parquet files when opened do not contain the added column, and the Glue Catalog obviously doesn't show this added column either. I'm unsure if I'm conflicting in the configuration in some way. In the above configuration, the EMR job step complains about: `Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Unable to validate the rewritten record {"id": "{REDACTED_UUID}", "published_at": 1464267354683} against schema {OUR SOURCE SCHEMA FETCHED FROM OUR REGISTRY}` so it is TRYING to do something, and this is likely due to a non backwards compatible schema change. If we try to make a backwards compatible change like `hoodie.deltastreamer.transformer.sql=select *, '1' AS test_field FROM <SRC>` there are no errors, but nothing happens. The table and parquet files don't contain the data. This is confusing considering that. `hoodie.avro.schema.validate=false` Hoping there's an expert out there on this issue? Ideally we'd like to be able to derive a field using the SQL transformer that we then use for the partitioning strategy, but I'd like to just see the transformer working for starters! **Environment Description** Using Hudi deltastreamer on EMR-6.7.0 on AWS. * Hudi version : 0.11.0 * Spark version : 3.2.1 * Hive version : 3.1.3 * Hadoop version : Amazon 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : N -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org