Hi, all: I have a hadoop file containing fields seperated by "!!", like below: !! field1 key1 value1 key2 value2 !! field2 key3 value3 key4 value4 !!
I want to read the file into a pair in TextInputFormat, specifying delimiter as "!!" First, I tried the following code: val hadoopConf = new Configuration() hadoopConf.set("textinputformat.record.delimiter", "!!\n") val path = args(0) val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf) rdd.take(3).foreach(println) Far from expectation, the result is: (120,) (120,) (120,) According to my experimentation, "120" is the byte offset of the last field separated by "!!" After digging into spark source code, I find "textFileInput" is implemented as: hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) So, I modified my initial code into: (bold text is the modification) val hadoopConf = new Configuration() hadoopConf.set("textinputformat.record.delimiter", "!!\n") val path = args(0) val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf).*map(pair => pair._2.toString)* rdd.take(3).foreach(println) Then, the results are: filed1 key1 value1 key2 value2 field2 .... As expected. I'm confused by the first code snippet's behavior. Hope you can offer an explanation. Thanks! ----- Senior in Tsinghua Univ. github: http://www.github.com/uronce-cc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764.html Sent from the Apache Spark User List mailing list archive at Nabble.com.