Re: sc.textFileGroupByPath(*/*.txt)

2014-06-01 Thread Anwar Rizal
I presume that you need to have access to the path of each file you are
reading.

I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:

def openWithPath(inputPath: String, sc:SparkContext) =  {
  val fs= (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
  val filesIt   = fs.listFiles(path, false)
  val paths   = new ListBuffer[URI]
  while (filesIt.hasNext) {
paths += filesIt.next.getPath.toUri
  }
  val withPaths = paths.toList.map{  p =
sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s)  = (p, s.toString) }
  }
  withPaths.reduce{ _ ++ _ }
}
...

I would be interested if there is a better way to do the same thing ...

Cheers,
a:


On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Could you provide an example of what you mean?

 I know it's possible to create an RDD from a path with wildcards, like in
 the subject.

 For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
 provide a comma delimited list of paths.

 Nick

 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지:

 Hi All,

 Is it possible to create an RDD from a directory tree of the following
 form?

 RDD[(PATH, Seq[TEXT])]

 Thank you,
 Oleg




Re: sc.textFileGroupByPath(*/*.txt)

2014-06-01 Thread Oleg Proudnikov
Anwar,

Will try this as it might do exactly what I need. I will follow your
pattern but use sc.textFile() for each file.

I am now thinking that I could start with an RDD of file paths and map it
into (path, content) pairs, provided I could read a file on the server.

Thank you,
Oleg



On 1 June 2014 18:41, Anwar Rizal anriza...@gmail.com wrote:

 I presume that you need to have access to the path of each file you are
 reading.

 I don't know whether there is a good way to do that for HDFS, I need to
 read the files myself, something like:

 def openWithPath(inputPath: String, sc:SparkContext) =  {
   val fs= (new
 Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
   val filesIt   = fs.listFiles(path, false)
   val paths   = new ListBuffer[URI]
   while (filesIt.hasNext) {
 paths += filesIt.next.getPath.toUri
   }
   val withPaths = paths.toList.map{  p =
 sc.newAPIHadoopFile[LongWritable, Text,
 TextInputFormat](p.toString).map{ case (_,s)  = (p, s.toString) }
   }
   withPaths.reduce{ _ ++ _ }
 }
 ...

 I would be interested if there is a better way to do the same thing ...

 Cheers,
 a:


 On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Could you provide an example of what you mean?

 I know it's possible to create an RDD from a path with wildcards, like in
 the subject.

 For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
 provide a comma delimited list of paths.

 Nick

 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지:

 Hi All,

 Is it possible to create an RDD from a directory tree of the following
 form?

 RDD[(PATH, Seq[TEXT])]

 Thank you,
 Oleg





-- 
Kind regards,

Oleg