I have a simple Spark job that goes out to Cassandra, runs a pipe and stores 
results:

val sc = new SparkContext(conf)
val rdd = sc.cassandraTable(“keyspace", “table")
      .map(r => r.getInt(“column") + "\t" + 
write(get_lemmas(r.getString("tags"))))
      .pipe("python3 /tmp/scripts_and_models/scripts/run.py")
      .map(r => convertStr(r) )
      .coalesce(1,true)
      .saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt")
      //.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”))

When run on a single machine, everything is fine if I save to an hdfs file or 
save to Cassandra.
When run in cluster neither works:

 - When saving to file, I get an exception: User class threw exception: Output 
directory hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt already exists
 - When saving to Cassandra, only 4 rows are updated with empty data (I test on 
a 4-machine Spark cluster)

Any hints on how to debug this and where the problem could be?

- I delete the hdfs file before running
- Would really like the output to hdfs to work, so I can debug
- Then it would be nice to save to Cassandra

Reply via email to