Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread mj
I've got a simple pyspark program that generates two CSV files and then carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The program works fine for smaller volumes of records, but when it goes beyond 3 million records for the fact dataset, I get the error below. I'm running

Re: Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread Sebastián Ramírez
Your Spark is trying to load a hadoop library winutils.exe, which you don't have in your Windows: 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at