I definitely delete the file on the right HDFS, I only have one HDFS instance.
The problem seems to be in the CassandraRDD - reading always fails in some way when run on the cluster, but single-machine reads are okay. > On Feb 20, 2015, at 4:20 AM, Ilya Ganelin <ilgan...@gmail.com> wrote: > > The stupid question is whether you're deleting the file from hdfs on the > right node? > On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov <pavel.velik...@gmail.com > <mailto:pavel.velik...@gmail.com>> wrote: > Yeah, I do manually delete the files, but it still fails with this error. > >> On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya <ilya.gane...@capitalone.com >> <mailto:ilya.gane...@capitalone.com>> wrote: >> >> When writing to hdfs Spark will not overwrite existing files or directories. >> You must either manually delete these or use Java's Hadoop FileSystem class >> to remove them. >> >> >> >> Sent with Good (www.good.com <http://www.good.com/>) >> >> >> -----Original Message----- >> From: Pavel Velikhov [pavel.velik...@gmail.com >> <mailto:pavel.velik...@gmail.com>] >> Sent: Thursday, February 19, 2015 11:32 AM Eastern Standard Time >> To: user@spark.apache.org <mailto:user@spark.apache.org> >> Subject: Spark job fails on cluster but works fine on a single machine >> >> I have a simple Spark job that goes out to Cassandra, runs a pipe and stores >> results: >> >> val sc = new SparkContext(conf) >> val rdd = sc.cassandraTable(“keyspace", “table") >> .map(r => r.getInt(“column") + "\t" + >> write(get_lemmas(r.getString("tags")))) >> .pipe("python3 /tmp/scripts_and_models/scripts/run.py") >> .map(r => convertStr(r) ) >> .coalesce(1,true) >> .saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt") >> //.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”)) >> >> When run on a single machine, everything is fine if I save to an hdfs file >> or save to Cassandra. >> When run in cluster neither works: >> >> - When saving to file, I get an exception: User class threw exception: >> Output directory hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt <> >> already exists >> - When saving to Cassandra, only 4 rows are updated with empty data (I test >> on a 4-machine Spark cluster) >> >> Any hints on how to debug this and where the problem could be? >> >> - I delete the hdfs file before running >> - Would really like the output to hdfs to work, so I can debug >> - Then it would be nice to save to Cassandra >> >> The information contained in this e-mail is confidential and/or proprietary >> to Capital One and/or its affiliates. The information transmitted herewith >> is intended only for use by the individual or entity to which it is >> addressed. If the reader of this message is not the intended recipient, you >> are hereby notified that any review, retransmission, dissemination, >> distribution, copying or other use of, or taking of any action in reliance >> upon this information is strictly prohibited. If you have received this >> communication in error, please contact the sender and delete the material >> from your computer. >