Tanveer created SPARK-32116: ------------------------------- Summary: Python RDD containing a 'pyarrow record_batch object' to java RDD conversion issue Key: SPARK-32116 URL: https://issues.apache.org/jira/browse/SPARK-32116 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.3.4 Reporter: Tanveer
I want to convert a Python 'prdd' containing a 'pyarrow record_batch object' to 'jrdd' to further use that jrdd for Spark Dataframe conversion. But I am facing some issue in deleaing with it. Please see the attached log. I am new to Spark and I'm struggling with this issue from many days. All the codes in pyspark with Arrow repo is about Pnadas to Arrow. But I want to to make an 'ardd' from Arrow recordbatches and then convert it into Spark Dataframe. Is my approach is right? No one is answering in mailing list. Please, someone, guide me on this issue. Thanks. {code:java} data = [pa.array(range(5), type='int16'),pa.array([-10, -5, 0, None, 10], type='int32')] batch = pa.record_batch(data, ['c0', 'c1']) data_rdd = spark.sparkContext.parallelize(batch) data_java_rdd = data_rdd._to_java_object_rdd() data_python_rdd = spark.sparkContext._jvm.SerDeUtil.javaToPython(data_java_rdd) converted_rdd = RDD(data_python_rdd, spark.sparkContext) print(converted_rdd.count()) {code} Log: {code:java} 2020-06-28 07:09:54 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2020-06-28 07:09:55 INFO SparkContext:54 - Running Spark version 2.3.4 2020-06-28 07:09:55 INFO SparkContext:54 - Submitted application: Python Arrow-in-Spark example 2020-06-28 07:09:55 INFO SecurityManager:54 - Changing view acls to: tahmad 2020-06-28 07:09:55 INFO SecurityManager:54 - Changing modify acls to: tahmad 2020-06-28 07:09:55 INFO SecurityManager:54 - Changing view acls groups to: 2020-06-28 07:09:55 INFO SecurityManager:54 - Changing modify acls groups to: 2020-06-28 07:09:55 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tahmad); groups with view permissions: Set(); users with modify permissions: Set(tahmad); groups with modify permissions: Set() 2020-06-28 07:09:55 INFO Utils:54 - Successfully started service 'sparkDriver' on port 33475. 2020-06-28 07:09:55 INFO SparkEnv:54 - Registering MapOutputTracker 2020-06-28 07:09:55 INFO SparkEnv:54 - Registering BlockManagerMaster 2020-06-28 07:09:55 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2020-06-28 07:09:55 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up 2020-06-28 07:09:55 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-e7d2bdc7-ae0f-4186-b1c1-bcde7bbdccfa 2020-06-28 07:09:55 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB 2020-06-28 07:09:55 INFO SparkEnv:54 - Registering OutputCommitCoordinator 2020-06-28 07:09:55 INFO log:192 - Logging initialized @2270ms 2020-06-28 07:09:55 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 2020-06-28 07:09:55 INFO Server:419 - Started @2337ms 2020-06-28 07:09:55 INFO AbstractConnector:278 - Started ServerConnector@27de2e9{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2020-06-28 07:09:55 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040. 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a902f7b{/jobs,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58cbdcbe{/jobs/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b0b49e8{/jobs/job,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@61bc464a{/jobs/job/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@671448ef{/stages,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@764119ec{/stages/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7a535077{/stages/stage,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@42385b89{/stages/stage/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@22086b51{/stages/pool,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2d62d5be{/stages/pool/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@540f4794{/storage,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@546bba8f{/storage/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35d811b0{/storage/rdd,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@eacfd90{/storage/rdd/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@20b13836{/environment,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2987516d{/environment/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5c27907c{/executors,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@358eb615{/executors/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1cd7481a{/executors/threadDump,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@fc87ed4{/executors/threadDump/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f8cab21{/static,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7d4c5b58{/,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54cdd155{/api,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c5a66d2{/jobs/job/kill,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@9218486{/stages/stage/kill,null,AVAILABLE,@Spark} 2020-06-28 07:09:55 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://tcn862:4040 2020-06-28 07:09:55 INFO SparkContext:54 - Added file file:/nfs/home3/tahmad/tahmad/script.py at file:/nfs/home3/tahmad/tahmad/script.py with timestamp 1593320995973 2020-06-28 07:09:55 INFO Utils:54 - Copying /nfs/home3/tahmad/tahmad/script.py to /tmp/spark-405f8ca1-a57f-4cae-8fa4-d459dd74b5d7/userFiles-d8452ced-0f8d-47b0-9d31-95fd736628a4/script.py 2020-06-28 07:09:56 INFO Executor:54 - Starting executor ID driver on host localhost 2020-06-28 07:09:56 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32978. 2020-06-28 07:09:56 INFO NettyBlockTransferService:54 - Server created on tcn862:32978 2020-06-28 07:09:56 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 2020-06-28 07:09:56 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, tcn862, 32978, None) 2020-06-28 07:09:56 INFO BlockManagerMasterEndpoint:54 - Registering block manager tcn862:32978 with 366.3 MB RAM, BlockManagerId(driver, tcn862, 32978, None) 2020-06-28 07:09:56 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, tcn862, 32978, None) 2020-06-28 07:09:56 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, tcn862, 32978, None) 2020-06-28 07:09:56 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@56ce8be9{/metrics/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:56 INFO SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/nfs/home3/tahmad/tahmad/spark-warehouse/'). 2020-06-28 07:09:56 INFO SharedState:54 - Warehouse path is 'file:/nfs/home3/tahmad/tahmad/spark-warehouse/'. 2020-06-28 07:09:56 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@36e1dd67{/SQL,null,AVAILABLE,@Spark} 2020-06-28 07:09:56 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@77e876c7{/SQL/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:56 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2fb8cdb0{/SQL/execution,null,AVAILABLE,@Spark} 2020-06-28 07:09:56 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@651413b3{/SQL/execution/json,null,AVAILABLE,@Spark} 2020-06-28 07:09:56 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7db9e9cb{/static/sql,null,AVAILABLE,@Spark} 2020-06-28 07:09:56 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint 2020-06-28 07:09:56 INFO SparkContext:54 - Starting job: count at /nfs/home3/tahmad/tahmad/script.py:301 2020-06-28 07:09:56 INFO DAGScheduler:54 - Got job 0 (count at /nfs/home3/tahmad/tahmad/script.py:301) with 24 output partitions 2020-06-28 07:09:56 INFO DAGScheduler:54 - Final stage: ResultStage 0 (count at /nfs/home3/tahmad/tahmad/script.py:301) 2020-06-28 07:09:56 INFO DAGScheduler:54 - Parents of final stage: List() 2020-06-28 07:09:56 INFO DAGScheduler:54 - Missing parents: List() 2020-06-28 07:09:56 INFO DAGScheduler:54 - Submitting ResultStage 0 (PythonRDD[4] at count at /nfs/home3/tahmad/tahmad/script.py:301), which has no missing parents 2020-06-28 07:09:56 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 7.1 KB, free 366.3 MB) 2020-06-28 07:09:57 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.2 KB, free 366.3 MB) 2020-06-28 07:09:57 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on tcn862:32978 (size: 4.2 KB, free: 366.3 MB) 2020-06-28 07:09:57 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1039 2020-06-28 07:09:57 INFO DAGScheduler:54 - Submitting 24 missing tasks from ResultStage 0 (PythonRDD[4] at count at /nfs/home3/tahmad/tahmad/script.py:301) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 2020-06-28 07:09:57 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 24 tasks 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 10.0 in stage 0.0 (TID 10, localhost, executor driver, partition 10, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 11.0 in stage 0.0 (TID 11, localhost, executor driver, partition 11, PROCESS_LOCAL, 7999 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 12.0 in stage 0.0 (TID 12, localhost, executor driver, partition 12, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 13.0 in stage 0.0 (TID 13, localhost, executor driver, partition 13, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 14.0 in stage 0.0 (TID 14, localhost, executor driver, partition 14, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 15.0 in stage 0.0 (TID 15, localhost, executor driver, partition 15, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 16.0 in stage 0.0 (TID 16, localhost, executor driver, partition 16, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 17.0 in stage 0.0 (TID 17, localhost, executor driver, partition 17, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 18.0 in stage 0.0 (TID 18, localhost, executor driver, partition 18, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 19.0 in stage 0.0 (TID 19, localhost, executor driver, partition 19, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 20.0 in stage 0.0 (TID 20, localhost, executor driver, partition 20, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 21.0 in stage 0.0 (TID 21, localhost, executor driver, partition 21, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 22.0 in stage 0.0 (TID 22, localhost, executor driver, partition 22, PROCESS_LOCAL, 7839 bytes) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Starting task 23.0 in stage 0.0 (TID 23, localhost, executor driver, partition 23, PROCESS_LOCAL, 8009 bytes) 2020-06-28 07:09:57 INFO Executor:54 - Running task 7.0 in stage 0.0 (TID 7) 2020-06-28 07:09:57 INFO Executor:54 - Running task 5.0 in stage 0.0 (TID 5) 2020-06-28 07:09:57 INFO Executor:54 - Running task 12.0 in stage 0.0 (TID 12) 2020-06-28 07:09:57 INFO Executor:54 - Running task 14.0 in stage 0.0 (TID 14) 2020-06-28 07:09:57 INFO Executor:54 - Running task 6.0 in stage 0.0 (TID 6) 2020-06-28 07:09:57 INFO Executor:54 - Running task 1.0 in stage 0.0 (TID 1) 2020-06-28 07:09:57 INFO Executor:54 - Running task 9.0 in stage 0.0 (TID 9) 2020-06-28 07:09:57 INFO Executor:54 - Running task 8.0 in stage 0.0 (TID 8) 2020-06-28 07:09:57 INFO Executor:54 - Running task 10.0 in stage 0.0 (TID 10) 2020-06-28 07:09:57 INFO Executor:54 - Running task 4.0 in stage 0.0 (TID 4) 2020-06-28 07:09:57 INFO Executor:54 - Running task 0.0 in stage 0.0 (TID 0) 2020-06-28 07:09:57 INFO Executor:54 - Running task 3.0 in stage 0.0 (TID 3) 2020-06-28 07:09:57 INFO Executor:54 - Running task 11.0 in stage 0.0 (TID 11) 2020-06-28 07:09:57 INFO Executor:54 - Running task 2.0 in stage 0.0 (TID 2) 2020-06-28 07:09:57 INFO Executor:54 - Running task 13.0 in stage 0.0 (TID 13) 2020-06-28 07:09:57 INFO Executor:54 - Running task 23.0 in stage 0.0 (TID 23) 2020-06-28 07:09:57 INFO Executor:54 - Running task 22.0 in stage 0.0 (TID 22) 2020-06-28 07:09:57 INFO Executor:54 - Running task 21.0 in stage 0.0 (TID 21) 2020-06-28 07:09:57 INFO Executor:54 - Running task 20.0 in stage 0.0 (TID 20) 2020-06-28 07:09:57 INFO Executor:54 - Running task 19.0 in stage 0.0 (TID 19) 2020-06-28 07:09:57 INFO Executor:54 - Running task 18.0 in stage 0.0 (TID 18) 2020-06-28 07:09:57 INFO Executor:54 - Running task 17.0 in stage 0.0 (TID 17) 2020-06-28 07:09:57 INFO Executor:54 - Running task 16.0 in stage 0.0 (TID 16) 2020-06-28 07:09:57 INFO Executor:54 - Running task 15.0 in stage 0.0 (TID 15) 2020-06-28 07:09:57 INFO Executor:54 - Fetching file:/nfs/home3/tahmad/tahmad/script.py with timestamp 1593320995973 2020-06-28 07:09:57 INFO Utils:54 - /nfs/home3/tahmad/tahmad/script.py has been previously copied to /tmp/spark-405f8ca1-a57f-4cae-8fa4-d459dd74b5d7/userFiles-d8452ced-0f8d-47b0-9d31-95fd736628a4/script.py 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 334, boot = 293, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 332, boot = 292, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 330, boot = 290, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 340, boot = 299, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 341, boot = 301, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 329, boot = 289, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 328, boot = 287, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 344, boot = 303, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 345, boot = 305, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 326, boot = 285, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 347, boot = 307, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 324, boot = 284, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 349, boot = 308, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 322, boot = 282, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 350, boot = 310, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 319, boot = 279, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 352, boot = 312, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 354, boot = 314, init = 40, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 320, boot = 274, init = 46, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 319, boot = 276, init = 43, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 320, boot = 278, init = 42, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 48, boot = 6, init = 42, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 46, boot = 3, init = 42, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 51, boot = 10, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 54, boot = 13, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 56, boot = 15, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 356, boot = 315, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 61, boot = 20, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 65, boot = 24, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 69, boot = 28, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 72, boot = 30, init = 41, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 83, boot = 42, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 87, boot = 46, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 90, boot = 49, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 94, boot = 53, init = 40, finish = 1 2020-06-28 07:09:57 INFO Executor:54 - Finished task 5.0 in stage 0.0 (TID 5). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 17.0 in stage 0.0 (TID 17). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 13.0 in stage 0.0 (TID 13). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 18.0 in stage 0.0 (TID 18). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 14.0 in stage 0.0 (TID 14). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 12.0 in stage 0.0 (TID 12). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 1.0 in stage 0.0 (TID 1). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 7.0 in stage 0.0 (TID 7). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 6.0 in stage 0.0 (TID 6). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 8.0 in stage 0.0 (TID 8). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 15.0 in stage 0.0 (TID 15). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 19.0 in stage 0.0 (TID 19). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 105, boot = 63, init = 42, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 108, boot = 67, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 98, boot = 57, init = 40, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 101, boot = 59, init = 42, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 110, boot = 69, init = 41, finish = 0 2020-06-28 07:09:57 INFO Executor:54 - Finished task 22.0 in stage 0.0 (TID 22). 1418 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 0.0 in stage 0.0 (TID 0). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 9.0 in stage 0.0 (TID 9). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 114, boot = 73, init = 41, finish = 0 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 113, boot = 71, init = 41, finish = 1 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 116, boot = 75, init = 40, finish = 1 2020-06-28 07:09:57 INFO Executor:54 - Finished task 3.0 in stage 0.0 (TID 3). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 4.0 in stage 0.0 (TID 4). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 5.0 in stage 0.0 (TID 5) in 493 ms on localhost (executor driver) (1/24) 2020-06-28 07:09:57 INFO Executor:54 - Finished task 20.0 in stage 0.0 (TID 20). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 16.0 in stage 0.0 (TID 16). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 21.0 in stage 0.0 (TID 21). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO Executor:54 - Finished task 2.0 in stage 0.0 (TID 2). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 13.0 in stage 0.0 (TID 13) in 494 ms on localhost (executor driver) (2/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 17.0 in stage 0.0 (TID 17) in 493 ms on localhost (executor driver) (3/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 18.0 in stage 0.0 (TID 18) in 494 ms on localhost (executor driver) (4/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 14.0 in stage 0.0 (TID 14) in 496 ms on localhost (executor driver) (5/24) 2020-06-28 07:09:57 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 238, in main eval_type = read_int(infile) File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int raise EOFError EOFError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:336) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:475) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:458) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:290) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO PythonRunner:54 - Times: total = 89, boot = 15, init = 74, finish = 0 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 12.0 in stage 0.0 (TID 12) in 496 ms on localhost (executor driver) (6/24) 2020-06-28 07:09:57 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 238, in main eval_type = read_int(infile) File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int raise EOFError EOFError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:336) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:475) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:458) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:290) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 504 ms on localhost (executor driver) (7/24) 2020-06-28 07:09:57 ERROR PythonRunner:91 - This may have been caused by a prior exception: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO PythonAccumulatorV2:54 - Connected to AccumulatorServer at host: 127.0.0.1 port: 42450 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 7.0 in stage 0.0 (TID 7) in 502 ms on localhost (executor driver) (8/24) 2020-06-28 07:09:57 ERROR PythonRunner:91 - This may have been caused by a prior exception: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO Executor:54 - Finished task 10.0 in stage 0.0 (TID 10). 1461 bytes result sent to driver 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 6.0 in stage 0.0 (TID 6) in 504 ms on localhost (executor driver) (9/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 8.0 in stage 0.0 (TID 8) in 503 ms on localhost (executor driver) (10/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 15.0 in stage 0.0 (TID 15) in 500 ms on localhost (executor driver) (11/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 19.0 in stage 0.0 (TID 19) in 499 ms on localhost (executor driver) (12/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 22.0 in stage 0.0 (TID 22) in 498 ms on localhost (executor driver) (13/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 523 ms on localhost (executor driver) (14/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 9.0 in stage 0.0 (TID 9) in 505 ms on localhost (executor driver) (15/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 4.0 in stage 0.0 (TID 4) in 507 ms on localhost (executor driver) (16/24) 2020-06-28 07:09:57 ERROR Executor:91 - Exception in task 23.0 in stage 0.0 (TID 23) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 3.0 in stage 0.0 (TID 3) in 508 ms on localhost (executor driver) (17/24) 2020-06-28 07:09:57 ERROR Executor:91 - Exception in task 11.0 in stage 0.0 (TID 11) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 20.0 in stage 0.0 (TID 20) in 501 ms on localhost (executor driver) (18/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 21.0 in stage 0.0 (TID 21) in 501 ms on localhost (executor driver) (19/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 16.0 in stage 0.0 (TID 16) in 503 ms on localhost (executor driver) (20/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 2.0 in stage 0.0 (TID 2) in 509 ms on localhost (executor driver) (21/24) 2020-06-28 07:09:57 INFO TaskSetManager:54 - Finished task 10.0 in stage 0.0 (TID 10) in 506 ms on localhost (executor driver) (22/24) 2020-06-28 07:09:57 WARN TaskSetManager:66 - Lost task 11.0 in stage 0.0 (TID 11, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 ERROR TaskSetManager:70 - Task 11 in stage 0.0 failed 1 times; aborting job 2020-06-28 07:09:57 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 2020-06-28 07:09:57 INFO TaskSetManager:54 - Lost task 23.0 in stage 0.0 (TID 23) on localhost, executor driver: net.razorvine.pickle.PickleException (expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias)) [duplicate 1] 2020-06-28 07:09:57 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 2020-06-28 07:09:57 INFO TaskSchedulerImpl:54 - Cancelling stage 0 2020-06-28 07:09:57 INFO DAGScheduler:54 - ResultStage 0 (count at /nfs/home3/tahmad/tahmad/script.py:301) failed in 0.681 s due to Job aborted due to stage failure: Task 11 in stage 0.0 failed 1 times, most recent failure: Lost task 11.0 in stage 0.0 (TID 11, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) Driver stacktrace: 2020-06-28 07:09:57 INFO DAGScheduler:54 - Job 0 failed: count at /nfs/home3/tahmad/tahmad/script.py:301, took 0.731943 s Traceback (most recent call last): File "/nfs/home3/tahmad/tahmad/script.py", line 301, in <module> print(converted_ardd.count()) File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1053, in count File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1044, in sum File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 915, in fold File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 814, in collect File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/home/tahmad/tahmad/spark-2.3.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 0.0 failed 1 times, most recent failure: Lost task 11.0 in stage 0.0 (TID 11, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1661) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1649) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1648) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1648) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1882) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1820) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:944) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:165) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyarrow.lib.type_for_alias) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:444) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:250) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:173) 2020-06-28 07:09:57 INFO SparkContext:54 - Invoking stop() from shutdown hook 2020-06-28 07:09:57 INFO AbstractConnector:318 - Stopped Spark@27de2e9{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2020-06-28 07:09:57 INFO SparkUI:54 - Stopped Spark web UI at http://tcn862:4040 2020-06-28 07:09:57 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped! 2020-06-28 07:09:57 INFO MemoryStore:54 - MemoryStore cleared 2020-06-28 07:09:57 INFO BlockManager:54 - BlockManager stopped 2020-06-28 07:09:57 INFO BlockManagerMaster:54 - BlockManagerMaster stopped 2020-06-28 07:09:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped! 2020-06-28 07:09:57 INFO SparkContext:54 - Successfully stopped SparkContext 2020-06-28 07:09:57 INFO ShutdownHookManager:54 - Shutdown hook called 2020-06-28 07:09:57 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-405f8ca1-a57f-4cae-8fa4-d459dd74b5d7/pyspark-6272138d-ab9b-414e-9593-7406c89da076 2020-06-28 07:09:57 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-a864ecd8-9b40-4332-a06b-46f3f908421c 2020-06-28 07:09:57 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-405f8ca1-a57f-4cae-8fa4-d459dd74b5d7 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org