Thanks, I'll take a look. On Wed, Jan 20, 2016 at 1:38 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> You can use the sc.newAPIHadoopFile and pass your own InputFormat and > RecordReader which will read the compressed .gz files to your usecase. For > a start, you can look at the: > > - wholeTextFile implementation > <https://github.com/apache/spark/blob/ad1503f92e1f6e960a24f9f5d36b1735d1f5073a/core/src/main/scala/org/apache/spark/SparkContext.scala#L839> > - WholeTextFileInputFormat > <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala> > - WholeTextFileRecordReader > <https://github.com/apache/spark/blob/7a375bb87a8df56d9dde0c484e725e5c497a9876/core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala> > > > > > > Thanks > Best Regards > > On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony <femib...@gmail.com> wrote: > >> >> >> I have a set of log files I would like to read into an RDD. These >> files are all compressed .gz and are the filenames are date stamped. The >> source of these files is the page view statistics data for wikipedia >> >> http://dumps.wikimedia.org/other/pagecounts-raw/ >> >> The file names look like this: >> >> pagecounts-20090501-000000.gz >> pagecounts-20090501-010000.gz >> pagecounts-20090501-020000.gz >> >> What I would like to do is read in all such files in a directory and >> prepend the date from the filename (e.g. 20090501) to each row of the >> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead >> of *sc.textFile(..)*, which creates a PairRDD with the key being the >> file name with a path, but*sc.wholeTextFiles()* doesn't handle >> compressed .gz files. >> >> Any suggestions would be welcome. >> >> -- >> http://www.femibyte.com/twiki5/bin/view/Tech/ >> http://www.nextmatrix.com >> "Great spirits have always encountered violent opposition from mediocre >> minds." - Albert Einstein. >> > > -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.