I've got a simple pyspark program that generates two CSV files and then
carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The
program works fine for smaller volumes of records, but when it goes beyond 3
million records for the fact dataset, I get the error below. I'm running
Your Spark is trying to load a hadoop library winutils.exe, which you
don't have in your Windows:
14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at