say there's some logs: s3://log-collections/sys1/20141212/nginx.gz s3://log-collections/sys1/20141213/nginx-part-1.gz s3://log-collections/sys1/20141213/nginx-part-2.gz
I have a function that parse the logs for later analysis. I want to parse all the files. So I do this: logs = sc.textFile('s3://log-collections/sys1/') logs.map(parse).saveAsTextFile('s3://parsed-logs/') BUT, this will destroy the date separate naming shema.resulting: s3://parsed-logs/part-0000 s3://parsed-logs/part-0001 ... And the worse part is that when I got a new day logs. It seems rdd.saveAsTextFile couldn't just append the new day's log. So I create a RDD for every single file.and parse it, save to the name I want.like this: one = sc.textFile("s3://log-collections/sys1/20141213/nginx-part-1.gz") one.map(parse).saveAsTextFile("s3://parsed-logs/20141213/01/") which resulting: s3://parsed-logs/20141212/01/part-0000 s3://parsed-logs/20141213/01/part-0000 s3://parsed-logs/20141213/01/part-0001 s3://parsed-logs/20141213/02/part-0000 s3://parsed-logs/20141213/02/part-0001 s3://parsed-logs/20141213/02/part-0002 And when a new day's log comes. I just process that day's logs and put to the proper directory(or "key") THE PROBLEM is this way I have to create a seperated RDD for every single file. which couldn't take advantage of Spark's functionality of automatic parallel processing.[I'm trying to submit multi applications for each batch of files.] Or maybe I'd better use hadoop streaming for this ? Any suggestions?