Anwar,
Will try this as it might do exactly what I need. I will follow your
pattern but use sc.textFile() for each file.
I am now thinking that I could start with an RDD of file paths and map it
into (path, content) pairs, provided I could read a file on the server.
Thank you,
Oleg
On 1 June 2014 18:41, Anwar Rizal anriza...@gmail.com wrote:
I presume that you need to have access to the path of each file you are
reading.
I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:
def openWithPath(inputPath: String, sc:SparkContext) = {
val fs= (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
val filesIt = fs.listFiles(path, false)
val paths = new ListBuffer[URI]
while (filesIt.hasNext) {
paths += filesIt.next.getPath.toUri
}
val withPaths = paths.toList.map{ p =
sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s) = (p, s.toString) }
}
withPaths.reduce{ _ ++ _ }
}
...
I would be interested if there is a better way to do the same thing ...
Cheers,
a:
On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Could you provide an example of what you mean?
I know it's possible to create an RDD from a path with wildcards, like in
the subject.
For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
provide a comma delimited list of paths.
Nick
2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지:
Hi All,
Is it possible to create an RDD from a directory tree of the following
form?
RDD[(PATH, Seq[TEXT])]
Thank you,
Oleg
--
Kind regards,
Oleg