geserdugarov opened a new issue, #11616: URL: https://github.com/apache/hudi/issues/11616
**Describe the problem you faced** After merge of https://github.com/apache/hudi/pull/10957, passing of `HoodieSparkKryoRegistrar` to `spark.kryo.registrator` became mandatory again when PySpark is used. Previously, the same issue was raised for 0.15-rc2 version: https://github.com/apache/hudi/issues/11334. How we should treat it, as a bug, when the expected user experience is not supported, or the change, when after migration to 1.0, `HoodieSparkKryoRegistrar` setting becomes mandatory? To fix this NullPointerException it's enough to set `spark.kryo.registrator` as `org.apache.spark.HoodieSparkKryoRegistrar`. It took some time for me to figure out the reason of NPE. During my search, I've already raise an issue for the problem: https://issues.apache.org/jira/browse/HUDI-7938 **To Reproduce** Run from PySpark: ``` input_data = [pyspark.sql.Row(id=1, name="a1", precomb=1), pyspark.sql.Row(id=2, name="a2", precomb=1)] df = spark.createDataFrame(input_data) # Hudi configuration parameters hudi_options = { "hoodie.table.name": "table_name", "hoodie.datasource.write.table.name": "table_name", "hoodie.datasource.write.table.type": "COPY_ON_WRITE", "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.precombine.field": "precomb" } (df.write .format("org.apache.hudi") .options(**hudi_options) .mode("overwrite") .save(tmp_dir_path)) df_load = spark.read.format("org.apache.hudi").load(tmp_dir_path) print("Finished loading, started to collect") print("Rows: ", df_load.collect()) ``` **Expected behavior** No `NullPointerException` **Environment Description** * Hudi version : 1.0.0-beta2-rc2 * Spark version : 3.4.3 * Hive version : 2.3.9 * Hadoop version : 3.3.4 * Storage (HDFS/S3/GCS..) : Local FS * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. **Stacktrace** ``` Caused by: java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:842) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73) at org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36) at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58) at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:193) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org