Yes, that's right, though "in order" depends on the RDD having an ordering, but so does the zip-based solution.
Actually, I'm going to walk that back a bit, since I don't see a guarantee that foldByKey behaves like foldLeft. The implementation underneath, in combineByKey, appears that it will act this way in practice though. On Tue, Sep 22, 2015 at 4:45 AM, Philip Weaver <philip.wea...@gmail.com> wrote: > Hmm, ok, but I'm not seeing why foldByKey is more appropriate than > reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order, > but reduceByKey is not? > > On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen <so...@cloudera.com> wrote: >> >> The zero value here is None. Combining None with any row should yield >> Some(row). After that, combining is a no-op for other rows. >> >> On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver <philip.wea...@gmail.com> >> wrote: >> > Hmm, I don't think that's what I want. There's no "zero value" in my use >> > case. >> > >> > On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> I think foldByKey is much more what you want, as it has more a notion >> >> of building up some result per key by encountering values serially. >> >> You would take the first and ignore the rest. Note that "first" >> >> depends on your RDD having an ordering to begin with, or else you rely >> >> on however it happens to be ordered after whatever operations give you >> >> a key-value RDD. >> >> >> >> On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver >> >> <philip.wea...@gmail.com> >> >> wrote: >> >> > I am processing a single file and want to remove duplicate rows by >> >> > some >> >> > key >> >> > by always choosing the first row in the file for that key. >> >> > >> >> > The best solution I could come up with is to zip each row with the >> >> > partition >> >> > index and local index, like this: >> >> > >> >> > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => >> >> > rows.zipWithIndex.map { case (row, localIndex) => (row.key, >> >> > ((partitionIndex, localIndex), row)) } >> >> > } >> >> > >> >> > >> >> > And then using reduceByKey with a min ordering on the >> >> > (partitionIndex, >> >> > localIndex) pair. >> >> > >> >> > First, can i count on SparkContext.textFile to read the lines in such >> >> > that >> >> > the partition indexes are always increasing so that the above works? >> >> > >> >> > And, is there a better way to accomplish the same effect? >> >> > >> >> > Thanks! >> >> > >> >> > - Philip >> >> > >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org