Fwd: Numbering RDD members Sequentially

Steve Lewis Wed, 11 Mar 2015 17:19:10 -0700

---------- Forwarded message ----------
From: Steve Lewis <lordjoe2...@gmail.com>
Date: Wed, Mar 11, 2015 at 9:13 AM
Subject: Re: Numbering RDD members Sequentially
To: "Daniel, Ronald (ELS-SDG)" <r.dan...@elsevier.com>



perfect - exactly what I was looking for, not quite sure why it is
called zipWithIndex
since zipping is not involved
my code does something like this where IMeasuredSpectrum is a large class
we want to set an index for

public static JavaRDD<IMeasuredSpectrum>
indexSpectra(JavaRDD<IMeasuredSpectrum> pSpectraToScore) {

    JavaPairRDD<IMeasuredSpectrum,Long> indexed =
pSpectraToScore.zipWithIndex();

    pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ;
    return pSpectraToScore;
}

public class AddIndexToSpectrum implements
Function<Tuple2<IMeasuredSpectrum, java.lang.Long>, IMeasuredSpectrum>
{
    @Override
    public IMeasuredSpectrum doCall(final Tuple2<IMeasuredSpectrum,
java.lang.Long> v1) throws Exception {
        IMeasuredSpectrum spec = v1._1();
        long index = v1._2();
        spec.setIndex(   index + 1 );
         return spec;
    }

   }

 }


On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

>  Have you looked at zipWithIndex?
>
>
>
> *From:* Steve Lewis [mailto:lordjoe2...@gmail.com]
> *Sent:* Tuesday, March 10, 2015 5:31 PM
> *To:* user@spark.apache.org
> *Subject:* Numbering RDD members Sequentially
>
>
>
> I have Hadoop Input Format which reads records and produces
>
>
>
> JavaPairRDD<String,String> locatedData  where
>
> _1() is a formatted version of the file location - like
>
> "000012690",, "000024386 ."000027523 ...
>
> _2() is data to be processed
>
>
>
> For historical reasons  I want to convert _1() into in integer
> representing the record number.
>
> so keys become "00000001", "0000002" ...
>
>
>
> (Yes I know this cannot be done in parallel) The PairRDD may be too large
> to collect and work on one machine but small enough to handle on a single
> machine.
>  I could use toLocalIterator to guarantee execution on one machine but
> last time I tried this all kinds of jobs were launched to get the next
> element of the iterator and I was not convinced this approach was efficient.
>
>
>

Fwd: Numbering RDD members Sequentially

Reply via email to