I have a set of log files I would like to read into an RDD. These files are all compressed .gz and are the filenames are date stamped. The source of these files is the page view statistics data for wikipedia
http://dumps.wikimedia.org/other/pagecounts-raw/ The file names look like this: pagecounts-20090501-000000.gz pagecounts-20090501-010000.gz pagecounts-20090501-020000.gz What I would like to do is read in all such files in a directory and prepend the date from the filename (e.g. 20090501) to each row of the resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of *sc.textFile(..)*, which creates a PairRDD with the key being the file name with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz files. Any suggestions would be welcome. -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.