You can use the sc.newAPIHadoopFile and pass your own InputFormat and RecordReader which will read the compressed .gz files to your usecase. For a start, you can look at the:
- wholeTextFile implementation <https://github.com/apache/spark/blob/ad1503f92e1f6e960a24f9f5d36b1735d1f5073a/core/src/main/scala/org/apache/spark/SparkContext.scala#L839> - WholeTextFileInputFormat <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala> - WholeTextFileRecordReader <https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala> Thanks Best Regards On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony <femib...@gmail.com> wrote: > > > I have a set of log files I would like to read into an RDD. These files > are all compressed .gz and are the filenames are date stamped. The source > of these files is the page view statistics data for wikipedia > > http://dumps.wikimedia.org/other/pagecounts-raw/ > > The file names look like this: > > pagecounts-20090501-000000.gz > pagecounts-20090501-010000.gz > pagecounts-20090501-020000.gz > > What I would like to do is read in all such files in a directory and > prepend the date from the filename (e.g. 20090501) to each row of the > resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of > *sc.textFile(..)*, which creates a PairRDD with the key being the file > name with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz > files. > > Any suggestions would be welcome. > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein. >