Exactly, the fields between "!!" is a (key, value) customized data structure.
So, newAPIHadoopFile may be the best practice now. For this specific format,
change the delimiter from default "\n" to "!!\n" can be the cheapest, and
this can only be done in hadoop2.x, in hadoop1.x, this can be done b
Nop.
My input file's format is:
!!
string1
string2
!!
string3
string4
sc.textFile("path) will return RDD("!!", "string1", "string2", "!!",
"string3", "string4")
what we need now is to transform this rdd to RDD("string1", "string2",
"string3", "string4")
your solution may not handle this.
--
Yes, I can implement like:
sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0)
But the reduce operation is expensive! I tested these two methods on a 6G
file, the only operation with the created RDD is take(10).foreach(println),
the method using newAPIHadoopFile only take
the value in (key, value) returned by textFile is exactly one line of the
input.
But what I want is the field between the two “!!”, hope this makes sense.
-
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context:
http://apache-spark-user-list.10015
Hi, all:
I have a hadoop file containing fields seperated by "!!", like below:
!!
field1
key1 value1
key2 value2
!!
field2
key3 value3
key4 value4
!!
I want to read the file into a pair in TextInputFormat, specifying delimiter
as "!!"
First, I tried the following code:
val hadoopConf = new
Here is a tutorial on how to customize your own file format in hadoop:
https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
and once you get your own file format, you can use it the same way as
TextInputFormat in spark as you have done in this post.
--
View this message in conte