danielfordfc opened a new issue, #7049:
URL: https://github.com/apache/hudi/issues/7049

   We are using Hudi 0.11.0 Hudi Deltastreamer on emr-6.7.0 to read data in 
from our Confluent Kafka cluster w/ Schema registry , and write it to a Glue 
Catalog table to be queried through Athena.
   
   Our spark-submit command is as follows:
   
   ```
   "spark-submit",
                   "--jars",
                   
"/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar",
                   "--class",
                   
"org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
                   "--conf",
                   
"spark.serializer=org.apache.spark.serializer.KryoSerializer",
                   "--conf",
                   "spark.sql.catalogImplementation=hive",
                   "--conf",
                   "spark.sql.hive.convertMetastoreParquet=false",
                   "--conf",
                   
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
                   "--conf",
                   
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog",
                   "--conf",
            
"spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
                   "/usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.0-amzn-0.jar",
                   "--props",
                   f"s3://{bucket}/configs/{source}.properties",
                   "--source-class",
                   "org.apache.hudi.utilities.sources.AvroKafkaSource",
                   "--target-base-path",
                   f"s3://{bucket}/{source}/raw",
                   "--target-table",
                   source,
                   "--schemaprovider-class",
                   "org.apache.hudi.utilities.schema.SchemaRegistryProvider",
                   "--transformer-class",
                   
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
                   "--source-ordering-field",
                   "published_at",
                   #"--enable-sync",
   ```
   
   and our hudi properties file, that we've been trying to find the correct 
configuration for,  is as follows:
   
   ```
   #Example NonPartitionedGenerator config 
   hoodie.datasource.hive_sync.database=pandas_raw
   hoodie.database.name=pandas_raw
   hoodie.table.name=funding_channel_failed_allocation
   hoodie.datasource.hive_sync.table=funding_channel_failed_allocation
   hoodie.deltastreamer.transformer.sql=select id, published_at FROM <SRC>
   hoodie.datasource.write.precombine.field=published_at
   hoodie.datasource.write.recordkey.field=id
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   # 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   # hoodie.datasource.write.hive_style_partitioning=true
   hoodie.datasource.write.partitionpath.field=''
   hoodie.datasource.write.table.type=COPY_ON_WRITE
   hoodie.datasource.write.operation=UPSERT
   # hoodie.datasource.hive_sync.enable=true
   # hoodie.datasource.hive_sync.mode=hms
   # hoodie.datasource.hive_sync.sync_as_datasource=false
   # hoodie.datasource.hive_sync.use_jdbc=false
   # hoodie.datasource.hive_sync.use_pre_apache_input_format=true
   hoodie.avro.schema.validate=false
   hoodie.schema.on.read.enable=true
   ```
   
   We've been commenting and uncommenting the above fields to try and find a 
combination that works.
   
   Our issue is that the `hoodie.deltastreamer.transformer.sql` statement 
doesn't appear to be having any effect on the output parquet? The parquet files 
when opened do not contain the added column, and the Glue Catalog obviously 
doesn't show this added column either.
   
   I'm unsure if I'm conflicting in the configuration in some way.
   
   In the above  configuration, the EMR job step complains about:
   `Caused by: org.apache.hudi.exception.SchemaCompatibilityException: Unable 
to validate the rewritten record {"id": "{REDACTED_UUID}", "published_at": 
1464267354683} against schema {OUR SOURCE SCHEMA FETCHED FROM OUR REGISTRY}`
   
   so it is TRYING to do something, and this is likely due to a non backwards 
compatible schema change. If we try to make a backwards compatible change like
   
   `hoodie.deltastreamer.transformer.sql=select *, '1' AS test_field FROM <SRC>`
   
   there are no errors, but nothing happens. The table and parquet files don't 
contain the data.
   
   This is confusing considering that.
   
   `hoodie.avro.schema.validate=false`
   
   Hoping there's an expert out there on this issue? Ideally we'd like to be 
able to derive a field using the SQL transformer that we then use for the 
partitioning strategy, but I'd like to just see the transformer working for 
starters!
   
   
   **Environment Description**
   Using Hudi deltastreamer on EMR-6.7.0 on AWS.
   
   * Hudi version : 0.11.0
   * Spark version : 3.2.1
   * Hive version : 3.1.3
   * Hadoop version : Amazon 3.2.1
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : N
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to