Anyway to make RDD preserve input directories structures?

逸君曹 Thu, 15 Jan 2015 18:18:05 -0800

say there's some logs:

s3://log-collections/sys1/20141212/nginx.gz
s3://log-collections/sys1/20141213/nginx-part-1.gz
s3://log-collections/sys1/20141213/nginx-part-2.gz


I have a function that parse the logs for later analysis.
I want to parse all the files. So I do this:

logs = sc.textFile('s3://log-collections/sys1/')
logs.map(parse).saveAsTextFile('s3://parsed-logs/')

BUT, this will destroy the date separate naming shema.resulting:

s3://parsed-logs/part-0000
s3://parsed-logs/part-0001
...

And the worse part is that when I got a new day logs.
It seems rdd.saveAsTextFile couldn't just append the new day's log.

So I create a RDD for every single file.and parse it, save to the name I
want.like this:

one = sc.textFile("s3://log-collections/sys1/20141213/nginx-part-1.gz")
one.map(parse).saveAsTextFile("s3://parsed-logs/20141213/01/")

which resulting:
s3://parsed-logs/20141212/01/part-0000
s3://parsed-logs/20141213/01/part-0000
s3://parsed-logs/20141213/01/part-0001
s3://parsed-logs/20141213/02/part-0000
s3://parsed-logs/20141213/02/part-0001
s3://parsed-logs/20141213/02/part-0002

And when a new day's log comes. I just process that day's logs and put to
the proper directory(or "key")

THE PROBLEM is this way I have to create a seperated RDD for every single
file.
which couldn't take advantage of Spark's functionality of automatic
parallel processing.[I'm trying to submit multi applications for each batch
of files.]

Or maybe I'd better use hadoop streaming for this ?
Any suggestions?

Anyway to make RDD preserve input directories structures?

Reply via email to