Re: Remove duplicate keys by always choosing first in file.

Sean Owen Mon, 21 Sep 2015 21:01:37 -0700

Yes, that's right, though "in order" depends on the RDD having an
ordering, but so does the zip-based solution.


Actually, I'm going to walk that back a bit, since I don't see a
guarantee that foldByKey behaves like foldLeft. The implementation
underneath, in combineByKey, appears that it will act this way in
practice though.

On Tue, Sep 22, 2015 at 4:45 AM, Philip Weaver <philip.wea...@gmail.com> wrote:
> Hmm, ok, but I'm not seeing why foldByKey is more appropriate than
> reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order,
> but reduceByKey is not?
>
> On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> The zero value here is None. Combining None with any row should yield
>> Some(row). After that, combining is a no-op for other rows.
>>
>> On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver <philip.wea...@gmail.com>
>> wrote:
>> > Hmm, I don't think that's what I want. There's no "zero value" in my use
>> > case.
>> >
>> > On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> I think foldByKey is much more what you want, as it has more a notion
>> >> of building up some result per key by encountering values serially.
>> >> You would take the first and ignore the rest. Note that "first"
>> >> depends on your RDD having an ordering to begin with, or else you rely
>> >> on however it happens to be ordered after whatever operations give you
>> >> a key-value RDD.
>> >>
>> >> On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver
>> >> <philip.wea...@gmail.com>
>> >> wrote:
>> >> > I am processing a single file and want to remove duplicate rows by
>> >> > some
>> >> > key
>> >> > by always choosing the first row in the file for that key.
>> >> >
>> >> > The best solution I could come up with is to zip each row with the
>> >> > partition
>> >> > index and local index, like this:
>> >> >
>> >> > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
>> >> >   rows.zipWithIndex.map { case (row, localIndex) => (row.key,
>> >> > ((partitionIndex, localIndex), row)) }
>> >> > }
>> >> >
>> >> >
>> >> > And then using reduceByKey with a min ordering on the
>> >> > (partitionIndex,
>> >> > localIndex) pair.
>> >> >
>> >> > First, can i count on SparkContext.textFile to read the lines in such
>> >> > that
>> >> > the partition indexes are always increasing so that the above works?
>> >> >
>> >> > And, is there a better way to accomplish the same effect?
>> >> >
>> >> > Thanks!
>> >> >
>> >> > - Philip
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Remove duplicate keys by always choosing first in file.

Reply via email to