[ https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Udit Mehrotra updated HUDI-1839: -------------------------------- Fix Version/s: (was: 0.9.0) 0.10.0 > FSUtils getAllPartitions broken by NotSerializableException: > org.apache.hadoop.fs.Path > -------------------------------------------------------------------------------------- > > Key: HUDI-1839 > URL: https://issues.apache.org/jira/browse/HUDI-1839 > Project: Apache Hudi > Issue Type: Bug > Reporter: satish > Priority: Blocker > Fix For: 0.10.0 > > > FSUtils getAllPartitionPaths is expected to work if metadata table is enabled > or not. It can also be called inside spark context. But looks like we are > trying to improve parallelism and causing NotSerializableExceptions. There > are multiple callers using it within spark context (clustering/cleaner). > See stack trace below > 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with FAILED (diag message: User class threw exception: > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321) > at > org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67) > at > org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71) > at > org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861) > at > com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111) > at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Failed to serialize task 53, not attempting to retry it. Exception > during serialization: java.io.NotSerializableException: > org.apache.hadoop.fs.Path > Serialization stack: > - object not serializable (class: org.apache.hadoop.fs.Path, value: > hdfs://...) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, > type: class [Ljava.lang.Object;) > - object (class scala.collection.mutable.WrappedArray$ofRef, > WrappedArray(hdfs://...)) > - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition) > - object (class org.apache.spark.rdd.ParallelCollectionPartition, > org.apache.spark.rdd.ParallelCollectionPartition@735) > - field (class: org.apache.spark.scheduler.ResultTask, name: partition, > type: interface org.apache.spark.Partition) > - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0)) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2091) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2110) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:968) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.collect(RDD.scala:967) > at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361) > at > org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) > at > org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:79) > at > org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:79) > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:319) -- This message was sent by Atlassian Jira (v8.3.4#803005)