Oops, I didn't catch the suggestion to just use RDD.zipWithIndex, which I
forgot existed (and I've discoverd I actually used in another project!). I
will use that instead of the mapPartitionsWithIndex/zipWithIndex solution
that I posted originally.

On Tue, Sep 22, 2015 at 9:07 AM, Philip Weaver <philip.wea...@gmail.com>
wrote:

> The indices are definitely necessary. My first solution was just
> reduceByKey { case (v, _) => v } and that didn't work. I needed to look at
> both values and see which had the lower index.
>
> On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> The point is that this only works if you already knew the file was
>> presented in order within and across partitions, which was the
>> original problem anyway. I don't think it is in general, but in
>> practice, I do imagine it's already in the expected order from
>> textFile. Maybe under the hood this ends up being ensured by
>> TextInputFormat.
>>
>> So, adding the index and sorting on it doesn't add anything.
>>
>> On Tue, Sep 22, 2015 at 4:38 PM, Adrian Tanase <atan...@adobe.com> wrote:
>> > just give zipWithIndex a shot, use it early in the pipeline. I think it
>> > provides exactly the info you need, as the index is the original line
>> number
>> > in the file, not the index in the partition.
>> >
>> > Sent from my iPhone
>> >
>> > On 22 Sep 2015, at 17:50, Philip Weaver <philip.wea...@gmail.com>
>> wrote:
>> >
>> > Thanks. If textFile can be used in a way that preserves order, than
>> both the
>> > partition index and the index within each partition should be
>> consistent,
>> > right?
>> >
>> > I overcomplicated the question by asking about removing duplicates.
>> > Fundamentally I think my question is, how does one sort lines in a file
>> by
>> > line number.
>> >
>> > On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <atan...@adobe.com>
>> wrote:
>> >>
>> >> By looking through the docs and source code, I think you can get away
>> with
>> >> rdd.zipWithIndex to get the index of each line in the file, as long as
>> you
>> >> define the parallelism upfront:
>> >> sc.textFile("README.md", 4)
>> >>
>> >> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m
>> >> skimming through some tuples, hopefully this is clear enough.
>> >>
>> >> -adrian
>> >>
>> >> From: Philip Weaver
>> >> Date: Tuesday, September 22, 2015 at 3:26 AM
>> >> To: user
>> >> Subject: Remove duplicate keys by always choosing first in file.
>> >>
>> >> I am processing a single file and want to remove duplicate rows by some
>> >> key by always choosing the first row in the file for that key.
>> >>
>> >> The best solution I could come up with is to zip each row with the
>> >> partition index and local index, like this:
>> >>
>> >> rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
>> >>   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
>> >> ((partitionIndex, localIndex), row)) }
>> >> }
>> >>
>> >>
>> >> And then using reduceByKey with a min ordering on the (partitionIndex,
>> >> localIndex) pair.
>> >>
>> >> First, can i count on SparkContext.textFile to read the lines in such
>> that
>> >> the partition indexes are always increasing so that the above works?
>> >>
>> >> And, is there a better way to accomplish the same effect?
>> >>
>> >> Thanks!
>> >>
>> >> - Philip
>> >>
>> >
>>
>
>

Reply via email to