Re: skip lines in spark

Xiangrui Meng Wed, 23 Apr 2014 09:52:01 -0700

If the first partition doesn't have enough records, then it may not
drop enough lines. Try


rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1)

It might trigger a job.

Best,
Xiangrui

On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai <dbt...@stanford.edu> wrote:
> Hi Chengi,
>
> If you just want to skip first n lines in RDD, you can do
>
> rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String])
> => {
>   if (partitionIdx == 0) {
>     lines.drop(n)
>   }
>   lines
> }
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu <chengi.liu...@gmail.com> wrote:
>>
>> Hi,
>>   What is the easiest way to skip first n lines in rdd??
>> I am not able to figure this one out?
>> Thanks
>
>

Re: skip lines in spark

Reply via email to