Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Parthus
Hi,

It might be a naive question, but I still wish that somebody could help me
handle it.

I have a textFile, in which every 4 lines represent a record. Since
SparkContext.textFile() API deems of one line as a record, it does not fit
into my case. I know that SparkContext.hadoopFile or newAPIHadoopFile API
can read a file in an arbitrary format, but I do not know how to use them. I
think that there must be some API which can easily solve this problem, but I
am kind of a bad googler and cannot find it by myself online.

Would it be possible for somebody to tell me how to use the API? I run Spark
based on Hadoop 1.2.1 rather than Hadoop 2.x. I wish that I could get
several lines of code which actually works, if possible.

Thanks very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-a-TextFile-1-record-contains-4-lines-into-a-RDD-tp17256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Xiangrui Meng
If your file is not very large, try

sc.wholeTextFiles(...).values.flatMap(_.split(\n).grouped(4).map(_.mkString(\n)))

-Xiangrui


On Sat, Oct 25, 2014 at 12:57 AM, Parthus peng.wei@gmail.com wrote:
 Hi,

 It might be a naive question, but I still wish that somebody could help me
 handle it.

 I have a textFile, in which every 4 lines represent a record. Since
 SparkContext.textFile() API deems of one line as a record, it does not fit
 into my case. I know that SparkContext.hadoopFile or newAPIHadoopFile API
 can read a file in an arbitrary format, but I do not know how to use them. I
 think that there must be some API which can easily solve this problem, but I
 am kind of a bad googler and cannot find it by myself online.

 Would it be possible for somebody to tell me how to use the API? I run Spark
 based on Hadoop 1.2.1 rather than Hadoop 2.x. I wish that I could get
 several lines of code which actually works, if possible.

 Thanks very much.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Read-a-TextFile-1-record-contains-4-lines-into-a-RDD-tp17256.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org