It seems You are missing HADOOP_HOME in the environment. As it says: java.io.IOException: Could not locate executable *null*\bin\winutils.exe in the Hadoop binaries.
That null is supposed to be your HADOOP_HOME. Thanks Best Regards On Thu, Dec 18, 2014 at 7:10 PM, mj <jone...@gmail.com> wrote: > > Hi, > > I'm trying to use pyspark to save a simple rdd to a text file (code below), > but it keeps throwing an error. > > ----- Python Code ----- > items=["Hello", "world"] > items2 = sc.parallelize(items) > items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv') > > > ----- Error ------C:\Python27\python.exe "C:/Users/Mark > Jones/PycharmProjects/spark_test/spark_error_sample.py" > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/12/18 13:00:53 INFO SecurityManager: Changing view acls to: Mark Jones, > 14/12/18 13:00:53 INFO SecurityManager: Changing modify acls to: Mark > Jones, > 14/12/18 13:00:53 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(Mark Jones, ); > users with modify permissions: Set(Mark Jones, ) > 14/12/18 13:00:53 INFO Slf4jLogger: Slf4jLogger started > 14/12/18 13:00:53 INFO Remoting: Starting remoting > 14/12/18 13:00:53 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriver@192.168.19.83:54548] > 14/12/18 13:00:53 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://sparkDriver@192.168.19.83:54548] > 14/12/18 13:00:53 INFO Utils: Successfully started service 'sparkDriver' on > port 54548. > 14/12/18 13:00:53 INFO SparkEnv: Registering MapOutputTracker > 14/12/18 13:00:53 INFO SparkEnv: Registering BlockManagerMaster > 14/12/18 13:00:53 INFO DiskBlockManager: Created local directory at > C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141218130053-1ab9 > 14/12/18 13:00:53 INFO Utils: Successfully started service 'Connection > manager for block manager' on port 54551. > 14/12/18 13:00:53 INFO ConnectionManager: Bound socket to port 54551 with > id > = ConnectionManagerId(192.168.19.83,54551) > 14/12/18 13:00:53 INFO MemoryStore: MemoryStore started with capacity 265.1 > MB > 14/12/18 13:00:53 INFO BlockManagerMaster: Trying to register BlockManager > 14/12/18 13:00:53 INFO BlockManagerMasterActor: Registering block manager > 192.168.19.83:54551 with 265.1 MB RAM > 14/12/18 13:00:53 INFO BlockManagerMaster: Registered BlockManager > 14/12/18 13:00:53 INFO HttpFileServer: HTTP File server directory is > > C:\Users\MARKJO~1\AppData\Local\Temp\spark-a43340e8-2621-46b8-a44e-8874dd178393 > 14/12/18 13:00:53 INFO HttpServer: Starting HTTP Server > 14/12/18 13:00:54 INFO Utils: Successfully started service 'HTTP file > server' on port 54552. > 14/12/18 13:00:54 INFO Utils: Successfully started service 'SparkUI' on > port > 4040. > 14/12/18 13:00:54 INFO SparkUI: Started SparkUI at > http://192.168.19.83:4040 > 14/12/18 13:00:54 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/12/18 13:00:54 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in > the Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326) > at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76) > at > org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.<init>(Groups.java:77) > at > > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.<init>(SparkContext.scala:228) > at > > org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:214) > at > > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > 14/12/18 13:00:54 INFO AkkaUtils: Connecting to HeartbeatReceiver: > akka.tcp://sparkDriver@192.168.19.83:54548/user/HeartbeatReceiver > 14/12/18 13:00:55 INFO deprecation: mapred.tip.id is deprecated. Instead, > use mapreduce.task.id > 14/12/18 13:00:55 INFO deprecation: mapred.task.id is deprecated. Instead, > use mapreduce.task.attempt.id > 14/12/18 13:00:55 INFO deprecation: mapred.task.is.map is deprecated. > Instead, use mapreduce.task.ismap > 14/12/18 13:00:55 INFO deprecation: mapred.task.partition is deprecated. > Instead, use mapreduce.task.partition > 14/12/18 13:00:55 INFO deprecation: mapred.job.id is deprecated. Instead, > use mapreduce.job.id > 14/12/18 13:00:55 INFO SparkContext: Starting job: saveAsTextFile at > NativeMethodAccessorImpl.java:-2 > 14/12/18 13:00:55 INFO DAGScheduler: Got job 0 (saveAsTextFile at > NativeMethodAccessorImpl.java:-2) with 1 output partitions > (allowLocal=false) > 14/12/18 13:00:55 INFO DAGScheduler: Final stage: Stage 0(saveAsTextFile at > NativeMethodAccessorImpl.java:-2) > 14/12/18 13:00:55 INFO DAGScheduler: Parents of final stage: List() > 14/12/18 13:00:55 INFO DAGScheduler: Missing parents: List() > 14/12/18 13:00:55 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[4] at > saveAsTextFile at NativeMethodAccessorImpl.java:-2), which has no missing > parents > 14/12/18 13:00:55 INFO MemoryStore: ensureFreeSpace(59464) called with > curMem=0, maxMem=278019440 > 14/12/18 13:00:55 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 58.1 KB, free 265.1 MB) > 14/12/18 13:00:55 INFO DAGScheduler: Submitting 1 missing tasks from Stage > 0 > (MappedRDD[4] at saveAsTextFile at NativeMethodAccessorImpl.java:-2) > 14/12/18 13:00:55 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks > 14/12/18 13:00:55 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID > 0, localhost, ANY, 1350 bytes) > 14/12/18 13:00:55 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/12/18 13:00:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID > 0) > java.lang.NullPointerException > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) > at org.apache.hadoop.util.Shell.run(Shell.java:418) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) > at > > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633) > at > > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799) > at > > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > at > org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980) > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 14/12/18 13:00:56 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > localhost): java.lang.NullPointerException: > java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) > org.apache.hadoop.util.Shell.runCommand(Shell.java:445) > org.apache.hadoop.util.Shell.run(Shell.java:418) > > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) > org.apache.hadoop.util.Shell.execCommand(Shell.java:739) > org.apache.hadoop.util.Shell.execCommand(Shell.java:722) > > > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633) > > > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467) > > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) > > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) > org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) > org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799) > > > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980) > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > 14/12/18 13:00:56 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > 14/12/18 13:00:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/12/18 13:00:56 INFO TaskSchedulerImpl: Cancelling stage 0 > 14/12/18 13:00:56 INFO DAGScheduler: Failed to run saveAsTextFile at > NativeMethodAccessorImpl.java:-2 > Traceback (most recent call last): > File "C:/Users/Mark > Jones/PycharmProjects/spark_test/spark_error_sample.py", line 86, in > <module> > items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv') > File "C:\apps\spark\python\pyspark\rdd.py", line 1324, in saveAsTextFile > keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) > File "C:\apps\spark\python\build\py4j\java_gateway.py", line 538, in > __call__ > self.target_id, self.name) > File "C:\apps\spark\python\build\py4j\protocol.py", line 300, in > get_return_value > format(target_id, '.', name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > o36.saveAsTextFile. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 0.0 > (TID 0, localhost): java.lang.NullPointerException: > java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) > org.apache.hadoop.util.Shell.runCommand(Shell.java:445) > org.apache.hadoop.util.Shell.run(Shell.java:418) > > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) > org.apache.hadoop.util.Shell.execCommand(Shell.java:739) > org.apache.hadoop.util.Shell.execCommand(Shell.java:722) > > > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633) > > > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467) > > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) > > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) > org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) > org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799) > > > org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) > org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980) > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) > at > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) > at > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) > at scala.Option.foreach(Option.scala:236) > at > > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) > at > > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > Process finished with exit code 1 > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-1-1-1-on-windows-saveAsTextFile-NullPointerException-tp20764.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >