I have a simple Spark job that goes out to Cassandra, runs a pipe and stores
results:
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable(“keyspace", “table")
.map(r => r.getInt(“column") + "\t" +
write(get_lemmas(r.getString("tags"))))
.pipe("python3 /tmp/scripts_and_models/scripts/run.py")
.map(r => convertStr(r) )
.coalesce(1,true)
.saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt")
//.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”))
When run on a single machine, everything is fine if I save to an hdfs file or
save to Cassandra.
When run in cluster neither works:
- When saving to file, I get an exception: User class threw exception: Output
directory hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt already exists
- When saving to Cassandra, only 4 rows are updated with empty data (I test on
a 4-machine Spark cluster)
Any hints on how to debug this and where the problem could be?
- I delete the hdfs file before running
- Would really like the output to hdfs to work, so I can debug
- Then it would be nice to save to Cassandra