just give zipWithIndex a shot, use it early in the pipeline. I think it 
provides exactly the info you need, as the index is the original line number in 
the file, not the index in the partition.

Sent from my iPhone

On 22 Sep 2015, at 17:50, Philip Weaver 
<philip.wea...@gmail.com<mailto:philip.wea...@gmail.com>> wrote:

Thanks. If textFile can be used in a way that preserves order, than both the 
partition index and the index within each partition should be consistent, right?

I overcomplicated the question by asking about removing duplicates. 
Fundamentally I think my question is, how does one sort lines in a file by line 
number.

On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase 
<atan...@adobe.com<mailto:atan...@adobe.com>> wrote:
By looking through the docs and source code, I think you can get away with 
rdd.zipWithIndex to get the index of each line in the file, as long as you 
define the parallelism upfront:
sc.textFile("README.md", 4)

You can then just do .groupBy(...).mapValues(_.sortBy(...).head) - I'm skimming 
through some tuples, hopefully this is clear enough.

-adrian

From: Philip Weaver
Date: Tuesday, September 22, 2015 at 3:26 AM
To: user
Subject: Remove duplicate keys by always choosing first in file.

I am processing a single file and want to remove duplicate rows by some key by 
always choosing the first row in the file for that key.

The best solution I could come up with is to zip each row with the partition 
index and local index, like this:

rdd.mapPartitionsWithIndex { case (partitionIndex, rows) =>
  rows.zipWithIndex.map { case (row, localIndex) => (row.key, ((partitionIndex, 
localIndex), row)) }
}

And then using reduceByKey with a min ordering on the (partitionIndex, 
localIndex) pair.

First, can i count on SparkContext.textFile to read the lines in such that the 
partition indexes are always increasing so that the above works?

And, is there a better way to accomplish the same effect?

Thanks!

- Philip


Reply via email to