Yes, I can implement like:

sc.textFile(“path”).reduce(_ + _).split(“!!”).filter(x => x.trim.length > 0)

But the reduce operation is expensive! I tested these two methods on a 6G file, 
the only operation with the created RDD is take(10).foreach(println), the 
method using newAPIHadoopFile only takes 2s while the code above will block for 
more than 1min because of the reduce I think.

Would you post the code snippet to illustrate your idea? I didn't come up with 
an easy map, filter operation sequences on the RDD returned by textFile. Thanks!
================================
常铖 cheng chang
Computer Science Dept. Tsinghua Univ.
Mobile Phone: 13681572414
WeChat ID: cccjcl
================================

在 2014年7月28日 下午5:40:21, chang cheng (myai...@gmail.com) 写到:

the value in (key, value) returned by textFile is exactly one line of the  
input.  

But what I want is the field between the two “!!”, hope this makes sense.  



-----  
Senior in Tsinghua Univ.  
github: http://www.github.com/uronce-cc  
--  
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10768.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.  

Reply via email to