Re: Appending filename information to RDD initialized by sc.textFile

2016-01-20 Thread Femi Anthony
Thanks, I'll take a look. On Wed, Jan 20, 2016 at 1:38 AM, Akhil Das wrote: > You can use the sc.newAPIHadoopFile and pass your own InputFormat and > RecordReader which will read the compressed .gz files to your usecase. For > a start, you can look at the: > > -

Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Femi Anthony
I have a set of log files I would like to read into an RDD. These files are all compressed .gz and are the filenames are date stamped. The source of these files is the page view statistics data for wikipedia http://dumps.wikimedia.org/other/pagecounts-raw/ The file names look like this:

Re: Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Akhil Das
You can use the sc.newAPIHadoopFile and pass your own InputFormat and RecordReader which will read the compressed .gz files to your usecase. For a start, you can look at the: - wholeTextFile implementation