---------- Forwarded message ----------
From: Steve Lewis <[email protected]>
Date: Wed, Mar 11, 2015 at 9:13 AM
Subject: Re: Numbering RDD members Sequentially
To: "Daniel, Ronald (ELS-SDG)" <[email protected]>
perfect - exactly what I was looking for, not quite sure why it is
called zipWithIndex
since zipping is not involved
my code does something like this where IMeasuredSpectrum is a large class
we want to set an index for
public static JavaRDD<IMeasuredSpectrum>
indexSpectra(JavaRDD<IMeasuredSpectrum> pSpectraToScore) {
JavaPairRDD<IMeasuredSpectrum,Long> indexed =
pSpectraToScore.zipWithIndex();
pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ;
return pSpectraToScore;
}
public class AddIndexToSpectrum implements
Function<Tuple2<IMeasuredSpectrum, java.lang.Long>, IMeasuredSpectrum>
{
@Override
public IMeasuredSpectrum doCall(final Tuple2<IMeasuredSpectrum,
java.lang.Long> v1) throws Exception {
IMeasuredSpectrum spec = v1._1();
long index = v1._2();
spec.setIndex( index + 1 );
return spec;
}
}
}
On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) <
[email protected]> wrote:
> Have you looked at zipWithIndex?
>
>
>
> *From:* Steve Lewis [mailto:[email protected]]
> *Sent:* Tuesday, March 10, 2015 5:31 PM
> *To:* [email protected]
> *Subject:* Numbering RDD members Sequentially
>
>
>
> I have Hadoop Input Format which reads records and produces
>
>
>
> JavaPairRDD<String,String> locatedData where
>
> _1() is a formatted version of the file location - like
>
> "000012690",, "000024386 ."000027523 ...
>
> _2() is data to be processed
>
>
>
> For historical reasons I want to convert _1() into in integer
> representing the record number.
>
> so keys become "00000001", "0000002" ...
>
>
>
> (Yes I know this cannot be done in parallel) The PairRDD may be too large
> to collect and work on one machine but small enough to handle on a single
> machine.
> I could use toLocalIterator to guarantee execution on one machine but
> last time I tried this all kinds of jobs were launched to get the next
> element of the iterator and I was not convinced this approach was efficient.
>
>
>