​There's multiple way to achieve this:

1. Read the N lines from the driver and then do a sc.parallelize(nlines) to
create an RDD out of it.
2. Create an RDD with N+M, do a take on N and then broadcast or parallelize
the returning list.
3. Something like this if the file is in hdfs:

    val n_f = (5,file_name)
     val n_lines = sc.parallelize(Array(n_f))
     val n_linesRDD = n_lines.map(n => {
     //Read and return 5 lines (n._1) from the file (n._2)

     })
 ​

Thanks
Best Regards

On Thu, Oct 29, 2015 at 9:51 PM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid>
wrote:

> Hi All,
>
> There is some file with line number N + M,, as I need to read the first N
> lines into one RDD .
>
> 1. i) read all the N + M lines as one RDD, ii) select the RDD's top N
> rows, may be some one solution;
> 2. if introduced some broadcast variable set N, then it is used to decide
> while map the file RDD. Only map its first N rows, this may not
> work, however.
>
> Is there some better solution?
>
> Thank you,
> Zhiliang
>
>

Reply via email to