[ https://issues.apache.org/jira/browse/SPARK-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen reassigned SPARK-2601: --------------------------------- Assignee: Josh Rosen > py4j.Py4JException on sc.pickleFile > ----------------------------------- > > Key: SPARK-2601 > URL: https://issues.apache.org/jira/browse/SPARK-2601 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Kevin Matzen > Assignee: Josh Rosen > Priority: Minor > > {code:title=test.py} > from pyspark import SparkContext > text_filename = 'README.md' > pickled_filename = 'pickled_file' > sc = SparkContext('local', 'Test Pipeline') > text_file = sc.textFile(text_filename) > text_file.saveAsPickleFile(pickled_filename) > pickled_file = sc.pickleFile(pickled_filename) > print pickled_file.first() > {code} > {code:title=bin/spark-submit test.py} > 14/07/20 13:29:21 INFO SecurityManager: Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/07/20 13:29:21 INFO SecurityManager: Changing view acls to: kmatzen > 14/07/20 13:29:21 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(kmatzen) > 14/07/20 13:29:22 INFO Slf4jLogger: Slf4jLogger started > 14/07/20 13:29:22 INFO Remoting: Starting remoting > 14/07/20 13:29:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sp...@en-cs-lr455-gr23.cs.cornell.edu:56405] > 14/07/20 13:29:22 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://sp...@en-cs-lr455-gr23.cs.cornell.edu:56405] > 14/07/20 13:29:22 INFO SparkEnv: Registering MapOutputTracker > 14/07/20 13:29:22 INFO SparkEnv: Registering BlockManagerMaster > 14/07/20 13:29:22 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140720132922-85ff > 14/07/20 13:29:22 INFO ConnectionManager: Bound socket to port 35372 with id > = ConnectionManagerId(en-cs-lr455-gr23.cs.cornell.edu,35372) > 14/07/20 13:29:22 INFO MemoryStore: MemoryStore started with capacity 294.9 MB > 14/07/20 13:29:22 INFO BlockManagerMaster: Trying to register BlockManager > 14/07/20 13:29:22 INFO BlockManagerMasterActor: Registering block manager > en-cs-lr455-gr23.cs.cornell.edu:35372 with 294.9 MB RAM > 14/07/20 13:29:22 INFO BlockManagerMaster: Registered BlockManager > 14/07/20 13:29:22 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-4261985b-cefa-410b-913f-8e5eed6d4bb9 > 14/07/20 13:29:22 INFO HttpServer: Starting HTTP Server > 14/07/20 13:29:27 INFO SparkUI: Started SparkUI at > http://en-cs-lr455-gr23.cs.cornell.edu:4040 > 14/07/20 13:29:28 INFO Utils: Copying /home/kmatzen/spark/test.py to > /tmp/spark-1f6dc228-f540-4824-b128-2d9b98a7ddcc/test.py > 14/07/20 13:29:28 INFO SparkContext: Added file > file:/home/kmatzen/spark/test.py at http://128.84.103.23:45332/files/test.py > with timestamp 1405877368118 > 14/07/20 13:29:28 INFO MemoryStore: ensureFreeSpace(34134) called with > curMem=0, maxMem=309225062 > 14/07/20 13:29:28 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 33.3 KB, free 294.9 MB) > 14/07/20 13:29:28 INFO SequenceFileRDDFunctions: Saving as sequence file of > type (NullWritable,BytesWritable) > 14/07/20 13:29:28 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/20 13:29:28 WARN LoadSnappy: Snappy native library not loaded > 14/07/20 13:29:28 INFO FileInputFormat: Total input paths to process : 1 > 14/07/20 13:29:28 INFO SparkContext: Starting job: saveAsObjectFile at > NativeMethodAccessorImpl.java:-2 > 14/07/20 13:29:28 INFO DAGScheduler: Got job 0 (saveAsObjectFile at > NativeMethodAccessorImpl.java:-2) with 1 output partitions (allowLocal=false) > 14/07/20 13:29:28 INFO DAGScheduler: Final stage: Stage 0(saveAsObjectFile at > NativeMethodAccessorImpl.java:-2) > 14/07/20 13:29:28 INFO DAGScheduler: Parents of final stage: List() > 14/07/20 13:29:28 INFO DAGScheduler: Missing parents: List() > 14/07/20 13:29:28 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[4] at > saveAsObjectFile at NativeMethodAccessorImpl.java:-2), which has no missing > parents > 14/07/20 13:29:28 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 > (MappedRDD[4] at saveAsObjectFile at NativeMethodAccessorImpl.java:-2) > 14/07/20 13:29:28 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks > 14/07/20 13:29:28 INFO TaskSetManager: Re-computing pending task lists. > 14/07/20 13:29:28 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 5627 bytes) > 14/07/20 13:29:28 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/07/20 13:29:28 INFO Executor: Fetching > http://128.84.103.23:45332/files/test.py with timestamp 1405877368118 > 14/07/20 13:29:28 INFO Utils: Fetching > http://128.84.103.23:45332/files/test.py to > /tmp/fetchFileTemp7128156481344843652.tmp > 14/07/20 13:29:28 INFO BlockManager: Found block broadcast_0 locally > 14/07/20 13:29:28 INFO HadoopRDD: Input split: > file:/home/kmatzen/spark/README.md:0+4506 > 14/07/20 13:29:28 INFO PythonRDD: Times: total = 218, boot = 196, init = 20, > finish = 2 > 14/07/20 13:29:28 INFO FileOutputCommitter: Saved output of task > 'attempt_201407201329_0000_m_000000_0' to > file:/home/kmatzen/spark/pickled_file > 14/07/20 13:29:28 INFO SparkHadoopWriter: > attempt_201407201329_0000_m_000000_0: Committed > 14/07/20 13:29:28 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1755 > bytes result sent to driver > 14/07/20 13:29:28 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 431 ms on localhost (1/1) > 14/07/20 13:29:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/07/20 13:29:28 INFO DAGScheduler: Stage 0 (saveAsObjectFile at > NativeMethodAccessorImpl.java:-2) finished in 0.454 s > 14/07/20 13:29:28 INFO SparkContext: Job finished: saveAsObjectFile at > NativeMethodAccessorImpl.java:-2, took 0.518879869 s > 14/07/20 13:29:28 INFO MemoryStore: ensureFreeSpace(34158) called with > curMem=34134, maxMem=309225062 > 14/07/20 13:29:28 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 33.4 KB, free 294.8 MB) > 14/07/20 13:29:28 INFO FileInputFormat: Total input paths to process : 1 > Traceback (most recent call last): > File "/home/kmatzen/spark/test.py", line 12, in <module> > print pickled_file.first() > File "/home/kmatzen/spark/python/pyspark/rdd.py", line 984, in first > return self.take(1)[0] > File "/home/kmatzen/spark/python/pyspark/rdd.py", line 970, in take > res = self.context.runJob(self, takeUpToNumLeft, p, True) > File "/home/kmatzen/spark/python/pyspark/context.py", line 714, in runJob > it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > javaPartitions, allowLocal) > File "/home/kmatzen/spark/python/pyspark/rdd.py", line 1600, in _jrdd > class_tag) > File > "/home/kmatzen/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", > line 669, in __call__ > File "/home/kmatzen/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", > line 304, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling > None.org.apache.spark.api.python.PythonRDD. Trace: > py4j.Py4JException: Constructor org.apache.spark.api.python.PythonRDD([class > org.apache.spark.rdd.FlatMappedRDD, class [B, class java.util.HashMap, class > java.util.ArrayList, class java.lang.Boolean, class java.lang.String, class > java.util.ArrayList, class org.apache.spark.Accumulator, class > scala.reflect.ManifestFactory$$anon$2]) does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:184) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:202) > at py4j.Gateway.invoke(Gateway.java:213) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) > {code} > Looks like this is related: > https://issues.apache.org/jira/browse/SPARK-1034 -- This message was sent by Atlassian JIRA (v6.2#6252)