Re: Spark job fails on cluster but works fine on a single machine

Pavel Velikhov Fri, 20 Feb 2015 05:07:39 -0800

I definitely delete the file on the right HDFS, I only have one HDFS instance.


The problem seems to be in the CassandraRDD - reading always fails in some way 
when run on the cluster, but single-machine reads are okay.



> On Feb 20, 2015, at 4:20 AM, Ilya Ganelin <ilgan...@gmail.com> wrote:
> 
> The stupid question is whether you're deleting the file from hdfs on the 
> right node?
> On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov <pavel.velik...@gmail.com 
> <mailto:pavel.velik...@gmail.com>> wrote:
> Yeah, I do manually delete the files, but it still fails with this error.
> 
>> On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya <ilya.gane...@capitalone.com 
>> <mailto:ilya.gane...@capitalone.com>> wrote:
>> 
>> When writing to hdfs Spark will not overwrite existing files or directories. 
>> You must either manually delete these or use Java's Hadoop FileSystem class 
>> to remove them.
>> 
>> 
>> 
>> Sent with Good (www.good.com <http://www.good.com/>)
>> 
>> 
>> -----Original Message-----
>> From: Pavel Velikhov [pavel.velik...@gmail.com 
>> <mailto:pavel.velik...@gmail.com>]
>> Sent: Thursday, February 19, 2015 11:32 AM Eastern Standard Time
>> To: user@spark.apache.org <mailto:user@spark.apache.org>
>> Subject: Spark job fails on cluster but works fine on a single machine
>> 
>> I have a simple Spark job that goes out to Cassandra, runs a pipe and stores 
>> results:
>> 
>> val sc = new SparkContext(conf)
>> val rdd = sc.cassandraTable(“keyspace", “table")
>>       .map(r => r.getInt(“column") + "\t" + 
>> write(get_lemmas(r.getString("tags"))))
>>       .pipe("python3 /tmp/scripts_and_models/scripts/run.py")
>>       .map(r => convertStr(r) )
>>       .coalesce(1,true)
>>       .saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt")
>>       //.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”))
>> 
>> When run on a single machine, everything is fine if I save to an hdfs file 
>> or save to Cassandra.
>> When run in cluster neither works:
>> 
>>  - When saving to file, I get an exception: User class threw exception: 
>> Output directory hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt <> 
>> already exists
>>  - When saving to Cassandra, only 4 rows are updated with empty data (I test 
>> on a 4-machine Spark cluster)
>> 
>> Any hints on how to debug this and where the problem could be?
>> 
>> - I delete the hdfs file before running
>> - Would really like the output to hdfs to work, so I can debug
>> - Then it would be nice to save to Cassandra
>> 
>> The information contained in this e-mail is confidential and/or proprietary 
>> to Capital One and/or its affiliates. The information transmitted herewith 
>> is intended only for use by the individual or entity to which it is 
>> addressed.  If the reader of this message is not the intended recipient, you 
>> are hereby notified that any review, retransmission, dissemination, 
>> distribution, copying or other use of, or taking of any action in reliance 
>> upon this information is strictly prohibited. If you have received this 
>> communication in error, please contact the sender and delete the material 
>> from your computer.
>

Re: Spark job fails on cluster but works fine on a single machine

Reply via email to