Thanks, I'll take a look.
On Wed, Jan 20, 2016 at 1:38 AM, Akhil Das
wrote:
> You can use the sc.newAPIHadoopFile and pass your own InputFormat and
> RecordReader which will read the compressed .gz files to your usecase. For
> a start, you can look at the:
>
> -
I have a set of log files I would like to read into an RDD. These files
are all compressed .gz and are the filenames are date stamped. The source
of these files is the page view statistics data for wikipedia
http://dumps.wikimedia.org/other/pagecounts-raw/
The file names look like this:
You can use the sc.newAPIHadoopFile and pass your own InputFormat and
RecordReader which will read the compressed .gz files to your usecase. For
a start, you can look at the:
- wholeTextFile implementation