Re: Appending filename information to RDD initialized by sc.textFile

2016-01-20 Thread Femi Anthony
Thanks, I'll take a look.

On Wed, Jan 20, 2016 at 1:38 AM, Akhil Das 
wrote:

> You can use the sc.newAPIHadoopFile and pass your own InputFormat and
> RecordReader which will read the compressed .gz files to your usecase. For
> a start, you can look at the:
>
> - wholeTextFile implementation
> 
> - WholeTextFileInputFormat
> 
> - WholeTextFileRecordReader
> 
>
>
>
>
>
> Thanks
> Best Regards
>
> On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony  wrote:
>
>>
>>
>>  I  have a set of log files I would like to read into an RDD. These
>> files are all compressed .gz and are the filenames are date stamped. The
>> source of these files is the page view statistics data for wikipedia
>>
>> http://dumps.wikimedia.org/other/pagecounts-raw/
>>
>> The file names look like this:
>>
>> pagecounts-20090501-00.gz
>> pagecounts-20090501-01.gz
>> pagecounts-20090501-02.gz
>>
>> What I would like to do is read in all such files in a directory and
>> prepend the date from the filename (e.g. 20090501) to each row of the
>> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead
>> of *sc.textFile(..)*, which creates a PairRDD with the key being the
>> file name with a path, but*sc.wholeTextFiles()* doesn't handle
>> compressed .gz files.
>>
>> Any suggestions would be welcome.
>>
>> --
>> http://www.femibyte.com/twiki5/bin/view/Tech/
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Femi Anthony
 I  have a set of log files I would like to read into an RDD. These files
are all compressed .gz and are the filenames are date stamped. The source
of these files is the page view statistics data for wikipedia

http://dumps.wikimedia.org/other/pagecounts-raw/

The file names look like this:

pagecounts-20090501-00.gz
pagecounts-20090501-01.gz
pagecounts-20090501-02.gz

What I would like to do is read in all such files in a directory and
prepend the date from the filename (e.g. 20090501) to each row of the
resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of
*sc.textFile(..)*, which creates a PairRDD with the key being the file name
with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz files.

Any suggestions would be welcome.

-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Akhil Das
You can use the sc.newAPIHadoopFile and pass your own InputFormat and
RecordReader which will read the compressed .gz files to your usecase. For
a start, you can look at the:

- wholeTextFile implementation

- WholeTextFileInputFormat

- WholeTextFileRecordReader






Thanks
Best Regards

On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony  wrote:

>
>
>  I  have a set of log files I would like to read into an RDD. These files
> are all compressed .gz and are the filenames are date stamped. The source
> of these files is the page view statistics data for wikipedia
>
> http://dumps.wikimedia.org/other/pagecounts-raw/
>
> The file names look like this:
>
> pagecounts-20090501-00.gz
> pagecounts-20090501-01.gz
> pagecounts-20090501-02.gz
>
> What I would like to do is read in all such files in a directory and
> prepend the date from the filename (e.g. 20090501) to each row of the
> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of
>  *sc.textFile(..)*, which creates a PairRDD with the key being the file
> name with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz
> files.
>
> Any suggestions would be welcome.
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>