There's multiple way to achieve this: 1. Read the N lines from the driver and then do a sc.parallelize(nlines) to create an RDD out of it. 2. Create an RDD with N+M, do a take on N and then broadcast or parallelize the returning list. 3. Something like this if the file is in hdfs:
val n_f = (5,file_name) val n_lines = sc.parallelize(Array(n_f)) val n_linesRDD = n_lines.map(n => { //Read and return 5 lines (n._1) from the file (n._2) }) Thanks Best Regards On Thu, Oct 29, 2015 at 9:51 PM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid> wrote: > Hi All, > > There is some file with line number N + M,, as I need to read the first N > lines into one RDD . > > 1. i) read all the N + M lines as one RDD, ii) select the RDD's top N > rows, may be some one solution; > 2. if introduced some broadcast variable set N, then it is used to decide > while map the file RDD. Only map its first N rows, this may not > work, however. > > Is there some better solution? > > Thank you, > Zhiliang > >