Re: skip lines in spark

Chengi Liu Wed, 23 Apr 2014 09:56:45 -0700

Xiangrui,
  So, is it that full code suggestion is :
val trigger = rddData.zipWithIndex().filter(
_._2 >= 10L).map(_._1)


and then what DB Tsai recommended
trigger.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String])
=> {
  if (partitionIdx == 0) {
    lines.drop(n)
  }
  lines
})

Is that the full operation..

What happens, if I have to drop so many records that the number exceeds
partition 0.. ??
How do i handle that case?




On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng <[email protected]> wrote:

> If the first partition doesn't have enough records, then it may not
> drop enough lines. Try
>
> rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1)
>
> It might trigger a job.
>
> Best,
> Xiangrui
>
> On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai <[email protected]> wrote:
> > Hi Chengi,
> >
> > If you just want to skip first n lines in RDD, you can do
> >
> > rddData.mapPartitionsWithIndex((partitionIdx: Int, lines:
> Iterator[String])
> > => {
> >   if (partitionIdx == 0) {
> >     lines.drop(n)
> >   }
> >   lines
> > }
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu <[email protected]>
> wrote:
> >>
> >> Hi,
> >>   What is the easiest way to skip first n lines in rdd??
> >> I am not able to figure this one out?
> >> Thanks
> >
> >
>

Re: skip lines in spark

Reply via email to