I am using Spark to aggregate logs that land in HDFS throughout the day. The job kicks off 15min after the hour and processes anything that landed the previous hour.
For example, the 2:15pm job will process anything that came in from 1:00pm-2:00pm. 99.9% of that data will consist of logs actually from the 1:00pm-2:00pm timespan. But 0.1% will be data that, for one of several reasons, trickled in from the 12:00pm hour or even earlier. What I'd like to do is split by RDD by timestamp into several RDDs, then use saveAsTextFile() to write each RDD to disk to its proper location. So 99.9% of the example data will go to /user/me/output/2014-11-29/13, while a small portion will go to /user/me/output/2014-11-29/12, and maybe a couple rows trickle in from the 10pm hour, and that aggregation goes to /user/me/output/2014-11-29/10. But when I run the job, I get error for the trickle /12 and /10 data saying those directories already exist. Is there a way I can do something like an INSERT INTO using saveAsTextFile to "append"? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-with-saveAsTextFile-tp20031.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org