I  have a set of log files I would like to read into an RDD. These files
are all compressed .gz and are the filenames are date stamped. The source
of these files is the page view statistics data for wikipedia

http://dumps.wikimedia.org/other/pagecounts-raw/

The file names look like this:

pagecounts-20090501-000000.gz
pagecounts-20090501-010000.gz
pagecounts-20090501-020000.gz

What I would like to do is read in all such files in a directory and
prepend the date from the filename (e.g. 20090501) to each row of the
resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of
*sc.textFile(..)*, which creates a PairRDD with the key being the file name
with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz files.

Any suggestions would be welcome.

-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Reply via email to