[I] [SUPPORT] Mandatory registration of `HoodieSparkKryoRegistrar` using PySpark for 1.0.0-beta2-rc2 [hudi]

via GitHub Thu, 11 Jul 2024 03:19:32 -0700


geserdugarov opened a new issue, #11616:
URL: https://github.com/apache/hudi/issues/11616


   **Describe the problem you faced**
   
   After merge of https://github.com/apache/hudi/pull/10957, passing of 
`HoodieSparkKryoRegistrar` to `spark.kryo.registrator` became mandatory again 
when PySpark is used. Previously, the same issue was raised for 0.15-rc2 
version: https://github.com/apache/hudi/issues/11334.
   
   How we should treat it, as a bug, when the expected user experience is not 
supported, or the change, when after migration to 1.0, 
`HoodieSparkKryoRegistrar` setting becomes mandatory?
   
   To fix this NullPointerException it's enough to set `spark.kryo.registrator` 
as `org.apache.spark.HoodieSparkKryoRegistrar`.
   
   It took some time for me to figure out the reason of NPE. During my search, 
I've already raise an issue for the problem:
   https://issues.apache.org/jira/browse/HUDI-7938
   
   **To Reproduce**
   
   Run from PySpark:
   ```
   input_data = [pyspark.sql.Row(id=1, name="a1", precomb=1),
                 pyspark.sql.Row(id=2, name="a2", precomb=1)]
   df = spark.createDataFrame(input_data)
   
   # Hudi configuration parameters
   hudi_options = {
       "hoodie.table.name": "table_name",
       "hoodie.datasource.write.table.name": "table_name",
       "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
       "hoodie.datasource.write.recordkey.field": "id",
       "hoodie.datasource.write.precombine.field": "precomb"
   }
   
   (df.write
    .format("org.apache.hudi")
    .options(**hudi_options)
    .mode("overwrite")
    .save(tmp_dir_path))
   
   df_load = spark.read.format("org.apache.hudi").load(tmp_dir_path)
   print("Finished loading, started to collect")
   print("Rows: ", df_load.collect())
   ```
   
   
   **Expected behavior**
   
   No `NullPointerException`
   
   **Environment Description**
   
   * Hudi version : 1.0.0-beta2-rc2
   
   * Spark version : 3.4.3
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : Local FS
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Caused by: java.lang.NullPointerException
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:842)
        at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
        at 
org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
        at 
org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
        at 
org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:193)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] Mandatory registration of `HoodieSparkKryoRegistrar` using PySpark for 1.0.0-beta2-rc2 [hudi]

Reply via email to