just give zipWithIndex a shot, use it early in the pipeline. I think it provides exactly the info you need, as the index is the original line number in the file, not the index in the partition.
Sent from my iPhone On 22 Sep 2015, at 17:50, Philip Weaver <philip.wea...@gmail.com<mailto:philip.wea...@gmail.com>> wrote: Thanks. If textFile can be used in a way that preserves order, than both the partition index and the index within each partition should be consistent, right? I overcomplicated the question by asking about removing duplicates. Fundamentally I think my question is, how does one sort lines in a file by line number. On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>> wrote: By looking through the docs and source code, I think you can get away with rdd.zipWithIndex to get the index of each line in the file, as long as you define the parallelism upfront: sc.textFile("README.md", 4) You can then just do .groupBy(...).mapValues(_.sortBy(...).head) - I'm skimming through some tuples, hopefully this is clear enough. -adrian From: Philip Weaver Date: Tuesday, September 22, 2015 at 3:26 AM To: user Subject: Remove duplicate keys by always choosing first in file. I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key. The best solution I could come up with is to zip each row with the partition index and local index, like this: rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => rows.zipWithIndex.map { case (row, localIndex) => (row.key, ((partitionIndex, localIndex), row)) } } And then using reduceByKey with a min ordering on the (partitionIndex, localIndex) pair. First, can i count on SparkContext.textFile to read the lines in such that the partition indexes are always increasing so that the above works? And, is there a better way to accomplish the same effect? Thanks! - Philip