[I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]

via GitHub Sun, 03 Dec 2023 20:34:49 -0800


Sarfaraz-214 opened a new issue, #10233:
URL: https://github.com/apache/hudi/issues/10233


   I am using HoodieStreamer with **Hudi 0.14** and trying to leverage 
[autogenerated 
keys](https://hudi.apache.org/releases/release-0.14.0/#support-for-hudi-tables-with-autogenerated-keys).
 Hence I am not passing **hoodie.datasource.write.recordkey.field** & 
**hoodie.datasource.write.precombine.field** .
   
   Additionally, I am passing **hoodie.spark.sql.insert.into.operation = 
insert** (instead of --op insert) which claims that there is no pre-combine key 
with bulk_insert and insert mode.
   
   With above the **.hoodie** directory gets created but the data write to GCS 
fails with error -
   ```
   org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in 
record. Acceptable fields were :[c1, c2, c3, c4, c5]
           at 
org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:601)
   ```
   
   I also see in **hoodie.properties** file pre-combine key is getting set to 
**ts** (hoodie.table.precombine.field=ts). Seems like this is getting set due 
to default value of **--source-ordering-field** . How can we skip the 
pre-combine field in this case?
   
   This is happening for both CoW & MoR tables.
   
   Actually this is running fine via Spark-SQL, but while using HoodieStreamer 
I am facing the issue. 
   
   Sharing the configurations used:
   
   **hudi-table.properties**
   ```
   hoodie.datasource.write.partitionpath.field=job_id
   hoodie.spark.sql.insert.into.operation=insert
   bootstrap.servers=***
   security.protocol=SASL_SSL
   sasl.mechanism=PLAIN
   sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule 
required username='***' password='***';
   auto.offset.reset=earliest
   hoodie.deltastreamer.source.kafka.topic=<topicName>
   
hoodie.deltastreamer.schemaprovider.source.schema.file=gs://<fullPath>/<schemaName>.avsc
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
   
   ```
   
   **spark-submit command**
   ```
   spark-submit \
       --class org.apache.hudi.utilities.streamer.HoodieStreamer \
       --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.0, \
       --properties-file /home/sarfaraz_h/spark-config.properties \
       --master yarn \
       --deploy-mode cluster \
       --driver-memory 12G \
       --driver-cores 3 \
       --executor-memory 12G \
       --executor-cores 3 \
       --num-executors 3 \
       --conf spark.yarn.maxAppAttempts=1 \
       --conf spark.sql.shuffle.partitions=18 \
       gs://<fullPath>/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
       --continuous \
       --source-limit 1000000 \
       --min-sync-interval-seconds 600 \
       --table-type COPY_ON_WRITE \
       --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
       --target-base-path gs://<fullPath>/<tableName> \
       --target-table <tableName> \
       --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
       --props gs://<fullPath>/configfolder/es_user_profile_config.properties
   ```
   
   Spark version used is - 3.3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]

Reply via email to