Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread mj
I've got a simple pyspark program that generates two CSV files and then
carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The
program works fine for smaller volumes of records, but when it goes beyond 3
million records for the fact dataset, I get the error below. I'm running
PySpark via PyCharm and the information for my environment is:

OS: Windows 7
Python version: 2.7.9
Spark version: 1.1.1
Java version: 1.8

I've also included the py file I am using. I'd appreciate any help you can
give me, 

MJ.


ERROR MESSAGE
C:\Python27\python.exe C:/Users/Mark
Jones/PycharmProjects/spark_test/spark_error_sample.py
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/12/16 12:48:26 INFO SecurityManager: Changing view acls to: Mark Jones,
14/12/16 12:48:26 INFO SecurityManager: Changing modify acls to: Mark Jones,
14/12/16 12:48:26 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(Mark Jones, );
users with modify permissions: Set(Mark Jones, )
14/12/16 12:48:26 INFO Slf4jLogger: Slf4jLogger started
14/12/16 12:48:27 INFO Remoting: Starting remoting
14/12/16 12:48:27 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@192.168.19.83:51387]
14/12/16 12:48:27 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkDriver@192.168.19.83:51387]
14/12/16 12:48:27 INFO Utils: Successfully started service 'sparkDriver' on
port 51387.
14/12/16 12:48:27 INFO SparkEnv: Registering MapOutputTracker
14/12/16 12:48:27 INFO SparkEnv: Registering BlockManagerMaster
14/12/16 12:48:27 INFO DiskBlockManager: Created local directory at
C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141216124827-11ef
14/12/16 12:48:27 INFO Utils: Successfully started service 'Connection
manager for block manager' on port 51390.
14/12/16 12:48:27 INFO ConnectionManager: Bound socket to port 51390 with id
= ConnectionManagerId(192.168.19.83,51390)
14/12/16 12:48:27 INFO MemoryStore: MemoryStore started with capacity 265.1
MB
14/12/16 12:48:27 INFO BlockManagerMaster: Trying to register BlockManager
14/12/16 12:48:27 INFO BlockManagerMasterActor: Registering block manager
192.168.19.83:51390 with 265.1 MB RAM
14/12/16 12:48:27 INFO BlockManagerMaster: Registered BlockManager
14/12/16 12:48:27 INFO HttpFileServer: HTTP File server directory is
C:\Users\MARKJO~1\AppData\Local\Temp\spark-3b772ca1-dbf7-4eaa-b62c-be5e73036f5d
14/12/16 12:48:27 INFO HttpServer: Starting HTTP Server
14/12/16 12:48:27 INFO Utils: Successfully started service 'HTTP file
server' on port 51391.
14/12/16 12:48:27 INFO Utils: Successfully started service 'SparkUI' on port
4040.
14/12/16 12:48:27 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040
14/12/16 12:48:27 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.init(Groups.java:77)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at 
org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.init(SparkContext.scala:228)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at 

Re: Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread Sebastián Ramírez
Your Spark is trying to load a hadoop library winutils.exe, which you
don't have in your Windows:

14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.
getQualifiedBinPath(Shell.java:318)
...


It's a known bug: https://issues.apache.org/jira/browse/SPARK-2356

That issue references this email thread:
http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7

And that email thread references this blog post:
https://social.msdn.microsoft.com/forums/azure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight


I had the same problem before, you may temporarily solve it by using the
distribution from the summit:
https://databricks-training.s3.amazonaws.com/getting-started.html
Or you may want to try that other solution.

...but in my case, I ended up running Spark from a Linux machine in a VM
after I got other errors.
I have the impression that development for Windows is not currently a big
priority, since the bug is from version 1.0...

I hope that helps.

Best,


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo https://twitter.com/tiangolo
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Tue, Dec 16, 2014 at 8:04 AM, mj jone...@gmail.com wrote:

 I've got a simple pyspark program that generates two CSV files and then
 carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The
 program works fine for smaller volumes of records, but when it goes beyond
 3
 million records for the fact dataset, I get the error below. I'm running
 PySpark via PyCharm and the information for my environment is:

 OS: Windows 7
 Python version: 2.7.9
 Spark version: 1.1.1
 Java version: 1.8

 I've also included the py file I am using. I'd appreciate any help you can
 give me,

 MJ.


 ERROR MESSAGE
 C:\Python27\python.exe C:/Users/Mark
 Jones/PycharmProjects/spark_test/spark_error_sample.py
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/12/16 12:48:26 INFO SecurityManager: Changing view acls to: Mark Jones,
 14/12/16 12:48:26 INFO SecurityManager: Changing modify acls to: Mark
 Jones,
 14/12/16 12:48:26 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(Mark Jones, );
 users with modify permissions: Set(Mark Jones, )
 14/12/16 12:48:26 INFO Slf4jLogger: Slf4jLogger started
 14/12/16 12:48:27 INFO Remoting: Starting remoting
 14/12/16 12:48:27 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@192.168.19.83:51387]
 14/12/16 12:48:27 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkDriver@192.168.19.83:51387]
 14/12/16 12:48:27 INFO Utils: Successfully started service 'sparkDriver' on
 port 51387.
 14/12/16 12:48:27 INFO SparkEnv: Registering MapOutputTracker
 14/12/16 12:48:27 INFO SparkEnv: Registering BlockManagerMaster
 14/12/16 12:48:27 INFO DiskBlockManager: Created local directory at
 C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141216124827-11ef
 14/12/16 12:48:27 INFO Utils: Successfully started service 'Connection
 manager for block manager' on port 51390.
 14/12/16 12:48:27 INFO ConnectionManager: Bound socket to port 51390 with
 id
 = ConnectionManagerId(192.168.19.83,51390)
 14/12/16 12:48:27 INFO MemoryStore: MemoryStore started with capacity 265.1
 MB
 14/12/16 12:48:27 INFO BlockManagerMaster: Trying to register BlockManager
 14/12/16 12:48:27 INFO BlockManagerMasterActor: Registering block manager
 192.168.19.83:51390 with 265.1 MB RAM
 14/12/16 12:48:27 INFO BlockManagerMaster: Registered BlockManager
 14/12/16 12:48:27 INFO HttpFileServer: HTTP File server directory is

 C:\Users\MARKJO~1\AppData\Local\Temp\spark-3b772ca1-dbf7-4eaa-b62c-be5e73036f5d
 14/12/16 12:48:27 INFO HttpServer: Starting HTTP Server
 14/12/16 12:48:27 INFO Utils: Successfully started service 'HTTP file
 server' on port 51391.
 14/12/16 12:48:27 INFO Utils: Successfully started service 'SparkUI' on
 port
 4040.
 14/12/16 12:48:27 INFO SparkUI: Started SparkUI at
 http://192.168.19.83:4040
 14/12/16 12:48:27 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.
 at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
 at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
 at