Appending filename information to RDD initialized by sc.textFile

Femi Anthony Tue, 19 Jan 2016 10:20:12 -0800

 I  have a set of log files I would like to read into an RDD. These files
are all compressed .gz and are the filenames are date stamped. The source
of these files is the page view statistics data for wikipedia


http://dumps.wikimedia.org/other/pagecounts-raw/

The file names look like this:

pagecounts-20090501-000000.gz
pagecounts-20090501-010000.gz
pagecounts-20090501-020000.gz

What I would like to do is read in all such files in a directory and
prepend the date from the filename (e.g. 20090501) to each row of the
resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of
*sc.textFile(..)*, which creates a PairRDD with the key being the file name
with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz files.

Any suggestions would be welcome.

-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Appending filename information to RDD initialized by sc.textFile

Reply via email to