Re: skip lines in spark

Chengi Liu Wed, 23 Apr 2014 09:57:21 -0700

Also, zipWithIndex() is not valid.. Did you meant zipParititions?


On Wed, Apr 23, 2014 at 9:55 AM, Chengi Liu <chengi.liu...@gmail.com> wrote:

> Xiangrui,
>   So, is it that full code suggestion is :
> val trigger = rddData.zipWithIndex().filter(
> _._2 >= 10L).map(_._1)
>
> and then what DB Tsai recommended
> trigger.mapPartitionsWithIndex((partitionIdx: Int, lines:
> Iterator[String]) => {
>   if (partitionIdx == 0) {
>     lines.drop(n)
>   }
>   lines
> })
>
> Is that the full operation..
>
> What happens, if I have to drop so many records that the number exceeds
> partition 0.. ??
> How do i handle that case?
>
>
>
>
> On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> If the first partition doesn't have enough records, then it may not
>> drop enough lines. Try
>>
>> rddData.zipWithIndex().filter(_._2 >= 10L).map(_._1)
>>
>> It might trigger a job.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai <dbt...@stanford.edu> wrote:
>> > Hi Chengi,
>> >
>> > If you just want to skip first n lines in RDD, you can do
>> >
>> > rddData.mapPartitionsWithIndex((partitionIdx: Int, lines:
>> Iterator[String])
>> > => {
>> >   if (partitionIdx == 0) {
>> >     lines.drop(n)
>> >   }
>> >   lines
>> > }
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu <chengi.liu...@gmail.com>
>> wrote:
>> >>
>> >> Hi,
>> >>   What is the easiest way to skip first n lines in rdd??
>> >> I am not able to figure this one out?
>> >> Thanks
>> >
>> >
>>
>
>

Re: skip lines in spark

Reply via email to