You can use the sc.newAPIHadoopFile and pass your own InputFormat and
RecordReader which will read the compressed .gz files to your usecase. For
a start, you can look at the:

- wholeTextFile implementation
<https://github.com/apache/spark/blob/ad1503f92e1f6e960a24f9f5d36b1735d1f5073a/core/src/main/scala/org/apache/spark/SparkContext.scala#L839>
- WholeTextFileInputFormat
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala>
- WholeTextFileRecordReader
<https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala>





Thanks
Best Regards

On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony <femib...@gmail.com> wrote:

>
>
>  I  have a set of log files I would like to read into an RDD. These files
> are all compressed .gz and are the filenames are date stamped. The source
> of these files is the page view statistics data for wikipedia
>
> http://dumps.wikimedia.org/other/pagecounts-raw/
>
> The file names look like this:
>
> pagecounts-20090501-000000.gz
> pagecounts-20090501-010000.gz
> pagecounts-20090501-020000.gz
>
> What I would like to do is read in all such files in a directory and
> prepend the date from the filename (e.g. 20090501) to each row of the
> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of
>  *sc.textFile(..)*, which creates a PairRDD with the key being the file
> name with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz
> files.
>
> Any suggestions would be welcome.
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>

Reply via email to