alessandro pontis created SPARK-47842:
-----------------------------------------

             Summary: Spark job relying over Hudi are blocked after one or zero 
commit
                 Key: SPARK-47842
                 URL: https://issues.apache.org/jira/browse/SPARK-47842
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Structured Streaming
    Affects Versions: 3.3.0
         Environment: Hudi version : 0.12.1-amzn-0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.3.3 amz
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no (EMR 6.9.0)
Additional context
            Reporter: alessandro pontis
         Attachments: Screenshot_20 1.png

Hello, we are facing the fact that some pyspark job that rely on Hudi seems to 
be blocked, in fact if we go over the spark console we can see the following 
situation:
!image-2024-04-13-15-57-56-605.png|width=1040,height=558!
we can see that we have 71 completed jobs but those are CDC process that should 
read from Kafka topic continuously. We verified yet that there are messages 
queued over the kafka topic. If you kill the application and then restart in 
some cases the job will act normally and other times the job still remain 
stacked.

Our deploy condition are the following:
We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate 
them in a target hudi table stored on Hive via a pyspark job running 24/7

 

PYSPARK WRITE
df_source.writeStream.foreachBatch(foreach_batch_write_function)
 {{  FOR EACH BATCH FUNCTION:
 #management of delete messages
  batchDF_deletes.write.format('hudi') \
              .option('hoodie.datasource.write.operation', 'delete') \
              .options(**hudiOptions_table) \
              .mode('append') \
              .save(S3_OUTPUT_PATH)

 #management of update and insert messages
  batchDF_upserts.write.format('org.apache.hudi') \
              .option('hoodie.datasource.write.operation', 'upsert') \
              .options(**hudiOptions_table) \
              .mode('append') \
              .save(S3_OUTPUT_PATH)}}
 
SPARK SUBMIT
spark-submit --master yarn --deploy-mode cluster --num-executors 1 
--executor-memory 1G --executor-cores 2 --conf 
spark.dynamicAllocation.enabled=false --packages 
org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --jars 
/usr/lib/hudi/hudi-spark-bundle.jar <path_to_script>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to