say there's some logs:
s3://log-collections/sys1/20141212/nginx.gz
s3://log-collections/sys1/20141213/nginx-part-1.gz
s3://log-collections/sys1/20141213/nginx-part-2.gz
I have a function that parse the logs for later analysis.
I want to parse all the files. So I do this:
logs = sc.textFile('s3://log-collections/sys1/')
logs.map(parse).saveAsTextFile('s3://parsed-logs/')
BUT, this will destroy the date separate naming shema.resulting:
s3://parsed-logs/part-0000
s3://parsed-logs/part-0001
...
And the worse part is that when I got a new day logs.
It seems rdd.saveAsTextFile couldn't just append the new day's log.
So I create a RDD for every single file.and parse it, save to the name I
want.like this:
one = sc.textFile("s3://log-collections/sys1/20141213/nginx-part-1.gz")
one.map(parse).saveAsTextFile("s3://parsed-logs/20141213/01/")
which resulting:
s3://parsed-logs/20141212/01/part-0000
s3://parsed-logs/20141213/01/part-0000
s3://parsed-logs/20141213/01/part-0001
s3://parsed-logs/20141213/02/part-0000
s3://parsed-logs/20141213/02/part-0001
s3://parsed-logs/20141213/02/part-0002
And when a new day's log comes. I just process that day's logs and put to
the proper directory(or "key")
THE PROBLEM is this way I have to create a seperated RDD for every single
file.
which couldn't take advantage of Spark's functionality of automatic
parallel processing.[I'm trying to submit multi applications for each batch
of files.]
Or maybe I'd better use hadoop streaming for this ?
Any suggestions?