Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)
I've got a simple pyspark program that generates two CSV files and then carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The program works fine for smaller volumes of records, but when it goes beyond 3 million records for the fact dataset, I get the error below. I'm running PySpark via PyCharm and the information for my environment is: OS: Windows 7 Python version: 2.7.9 Spark version: 1.1.1 Java version: 1.8 I've also included the py file I am using. I'd appreciate any help you can give me, MJ. ERROR MESSAGE C:\Python27\python.exe C:/Users/Mark Jones/PycharmProjects/spark_test/spark_error_sample.py Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/16 12:48:26 INFO SecurityManager: Changing view acls to: Mark Jones, 14/12/16 12:48:26 INFO SecurityManager: Changing modify acls to: Mark Jones, 14/12/16 12:48:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Mark Jones, ); users with modify permissions: Set(Mark Jones, ) 14/12/16 12:48:26 INFO Slf4jLogger: Slf4jLogger started 14/12/16 12:48:27 INFO Remoting: Starting remoting 14/12/16 12:48:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.19.83:51387] 14/12/16 12:48:27 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.19.83:51387] 14/12/16 12:48:27 INFO Utils: Successfully started service 'sparkDriver' on port 51387. 14/12/16 12:48:27 INFO SparkEnv: Registering MapOutputTracker 14/12/16 12:48:27 INFO SparkEnv: Registering BlockManagerMaster 14/12/16 12:48:27 INFO DiskBlockManager: Created local directory at C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141216124827-11ef 14/12/16 12:48:27 INFO Utils: Successfully started service 'Connection manager for block manager' on port 51390. 14/12/16 12:48:27 INFO ConnectionManager: Bound socket to port 51390 with id = ConnectionManagerId(192.168.19.83,51390) 14/12/16 12:48:27 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 14/12/16 12:48:27 INFO BlockManagerMaster: Trying to register BlockManager 14/12/16 12:48:27 INFO BlockManagerMasterActor: Registering block manager 192.168.19.83:51390 with 265.1 MB RAM 14/12/16 12:48:27 INFO BlockManagerMaster: Registered BlockManager 14/12/16 12:48:27 INFO HttpFileServer: HTTP File server directory is C:\Users\MARKJO~1\AppData\Local\Temp\spark-3b772ca1-dbf7-4eaa-b62c-be5e73036f5d 14/12/16 12:48:27 INFO HttpServer: Starting HTTP Server 14/12/16 12:48:27 INFO Utils: Successfully started service 'HTTP file server' on port 51391. 14/12/16 12:48:27 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/12/16 12:48:27 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040 14/12/16 12:48:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at
Re: Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)
Your Spark is trying to load a hadoop library winutils.exe, which you don't have in your Windows: 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell. getQualifiedBinPath(Shell.java:318) ... It's a known bug: https://issues.apache.org/jira/browse/SPARK-2356 That issue references this email thread: http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7 And that email thread references this blog post: https://social.msdn.microsoft.com/forums/azure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight I had the same problem before, you may temporarily solve it by using the distribution from the summit: https://databricks-training.s3.amazonaws.com/getting-started.html Or you may want to try that other solution. ...but in my case, I ended up running Spark from a Linux machine in a VM after I got other errors. I have the impression that development for Windows is not currently a big priority, since the bug is from version 1.0... I hope that helps. Best, *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email: sebastian.rami...@senseta.com www.senseta.com On Tue, Dec 16, 2014 at 8:04 AM, mj jone...@gmail.com wrote: I've got a simple pyspark program that generates two CSV files and then carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The program works fine for smaller volumes of records, but when it goes beyond 3 million records for the fact dataset, I get the error below. I'm running PySpark via PyCharm and the information for my environment is: OS: Windows 7 Python version: 2.7.9 Spark version: 1.1.1 Java version: 1.8 I've also included the py file I am using. I'd appreciate any help you can give me, MJ. ERROR MESSAGE C:\Python27\python.exe C:/Users/Mark Jones/PycharmProjects/spark_test/spark_error_sample.py Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/16 12:48:26 INFO SecurityManager: Changing view acls to: Mark Jones, 14/12/16 12:48:26 INFO SecurityManager: Changing modify acls to: Mark Jones, 14/12/16 12:48:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Mark Jones, ); users with modify permissions: Set(Mark Jones, ) 14/12/16 12:48:26 INFO Slf4jLogger: Slf4jLogger started 14/12/16 12:48:27 INFO Remoting: Starting remoting 14/12/16 12:48:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.19.83:51387] 14/12/16 12:48:27 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.19.83:51387] 14/12/16 12:48:27 INFO Utils: Successfully started service 'sparkDriver' on port 51387. 14/12/16 12:48:27 INFO SparkEnv: Registering MapOutputTracker 14/12/16 12:48:27 INFO SparkEnv: Registering BlockManagerMaster 14/12/16 12:48:27 INFO DiskBlockManager: Created local directory at C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141216124827-11ef 14/12/16 12:48:27 INFO Utils: Successfully started service 'Connection manager for block manager' on port 51390. 14/12/16 12:48:27 INFO ConnectionManager: Bound socket to port 51390 with id = ConnectionManagerId(192.168.19.83,51390) 14/12/16 12:48:27 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 14/12/16 12:48:27 INFO BlockManagerMaster: Trying to register BlockManager 14/12/16 12:48:27 INFO BlockManagerMasterActor: Registering block manager 192.168.19.83:51390 with 265.1 MB RAM 14/12/16 12:48:27 INFO BlockManagerMaster: Registered BlockManager 14/12/16 12:48:27 INFO HttpFileServer: HTTP File server directory is C:\Users\MARKJO~1\AppData\Local\Temp\spark-3b772ca1-dbf7-4eaa-b62c-be5e73036f5d 14/12/16 12:48:27 INFO HttpServer: Starting HTTP Server 14/12/16 12:48:27 INFO Utils: Successfully started service 'HTTP file server' on port 51391. 14/12/16 12:48:27 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/12/16 12:48:27 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040 14/12/16 12:48:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at