Thanks, I'll take a look.

On Wed, Jan 20, 2016 at 1:38 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> You can use the sc.newAPIHadoopFile and pass your own InputFormat and
> RecordReader which will read the compressed .gz files to your usecase. For
> a start, you can look at the:
>
> - wholeTextFile implementation
> <https://github.com/apache/spark/blob/ad1503f92e1f6e960a24f9f5d36b1735d1f5073a/core/src/main/scala/org/apache/spark/SparkContext.scala#L839>
> - WholeTextFileInputFormat
> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala>
> - WholeTextFileRecordReader
> <https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala>
>
>
>
>
>
> Thanks
> Best Regards
>
> On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony <femib...@gmail.com> wrote:
>
>>
>>
>>  I  have a set of log files I would like to read into an RDD. These
>> files are all compressed .gz and are the filenames are date stamped. The
>> source of these files is the page view statistics data for wikipedia
>>
>> http://dumps.wikimedia.org/other/pagecounts-raw/
>>
>> The file names look like this:
>>
>> pagecounts-20090501-000000.gz
>> pagecounts-20090501-010000.gz
>> pagecounts-20090501-020000.gz
>>
>> What I would like to do is read in all such files in a directory and
>> prepend the date from the filename (e.g. 20090501) to each row of the
>> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead
>> of *sc.textFile(..)*, which creates a PairRDD with the key being the
>> file name with a path, but*sc.wholeTextFiles()* doesn't handle
>> compressed .gz files.
>>
>> Any suggestions would be welcome.
>>
>> --
>> http://www.femibyte.com/twiki5/bin/view/Tech/
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Reply via email to