Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Exactly, the fields between "!!" is a (key, value) customized data structure. So, newAPIHadoopFile may be the best practice now. For this specific format, change the delimiter from default "\n" to "!!\n" can be the cheapest, and this can only be done in hadoop2.x, in hadoop1.x, this can be done b

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Nop. My input file's format is: !! string1 string2 !! string3 string4 sc.textFile("path) will return RDD("!!", "string1", "string2", "!!", "string3", "string4") what we need now is to transform this rdd to RDD("string1", "string2", "string3", "string4") your solution may not handle this. --

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Yes, I can implement like: sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0) But the reduce operation is expensive! I tested these two methods on a 6G file, the only operation with the created RDD is take(10).foreach(println), the method using newAPIHadoopFile only take

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
the value in (key, value) returned by textFile is exactly one line of the input. But what I want is the field between the two “!!”, hope this makes sense. - Senior in Tsinghua Univ. github: http://www.github.com/uronce-cc -- View this message in context: http://apache-spark-user-list.10015

Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Hi, all: I have a hadoop file containing fields seperated by "!!", like below: !! field1 key1 value1 key2 value2 !! field2 key3 value3 key4 value4 !! I want to read the file into a pair in TextInputFormat, specifying delimiter as "!!" First, I tried the following code: val hadoopConf = new

Re: Hadoop Input Format - newAPIHadoopFile

2014-07-28 Thread chang cheng
Here is a tutorial on how to customize your own file format in hadoop: https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat and once you get your own file format, you can use it the same way as TextInputFormat in spark as you have done in this post. -- View this message in conte