Re: Remove duplicate keys by always choosing first in file.

2015-09-24 Thread Philip Weaver
gt;> you >> >> define the parallelism upfront: >> >> sc.textFile("README.md", 4) >> >> >> >> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m >> >> skimming through some tuples, hopefully this is clear e

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
; rdd.zipWithIndex to get the index of each line in the file, as long as > you > >> define the parallelism upfront: > >> sc.textFile("README.md", 4) > >> > >> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m > >> skimming t

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
DME.md", 4) >> >> You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m >> skimming through some tuples, hopefully this is clear enough. >> >> -adrian >> >> From: Philip Weaver >> Date: Tuesday, September 22, 2015 at 3:26 AM >>

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
: > sc.textFile("README.md", 4) > > You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m > skimming through some tuples, hopefully this is clear enough. > > -adrian > > From: Philip Weaver > Date: Tuesday, September 22, 2015 at 3:26 AM > To: user &g

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
t 3:26 AM To: user Subject: Remove duplicate keys by always choosing first in file. I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key. The best solution I could come up with is to zip each row with the part

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
I don't know of a way to do this, out of the box, without maybe digging into custom InputFormats. The RDD from textFile doesn't have an ordering. I can't imagine a world in which partitions weren't iterated in line order, of course, but there's also no real guarantee about ordering among

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, I don't think that's what I want. There's no "zero value" in my use case. On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen wrote: > I think foldByKey is much more what you want, as it has more a notion > of building up some result per key by encountering values serially. >

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
The zero value here is None. Combining None with any row should yield Some(row). After that, combining is a no-op for other rows. On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver wrote: > Hmm, I don't think that's what I want. There's no "zero value" in my use > case. > >

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
Yes, that's right, though "in order" depends on the RDD having an ordering, but so does the zip-based solution. Actually, I'm going to walk that back a bit, since I don't see a guarantee that foldByKey behaves like foldLeft. The implementation underneath, in combineByKey, appears that it will act

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, ok, but I'm not seeing why foldByKey is more appropriate than reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order, but reduceByKey is not? On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen wrote: > The zero value here is None. Combining None with any

Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key. The best solution I could come up with is to zip each row with the partition index and local index, like this: rdd.mapPartitionsWithIndex { case (partitionIndex,

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
I think foldByKey is much more what you want, as it has more a notion of building up some result per key by encountering values serially. You would take the first and ignore the rest. Note that "first" depends on your RDD having an ordering to begin with, or else you rely on however it happens to