I have Hadoop Input Format which reads records and produces

JavaPairRDD<String,String> locatedData  where
_1() is a formatted version of the file location - like
"000012690",, "000024386 ."000027523 ...
_2() is data to be processed

For historical reasons  I want to convert _1() into in integer representing
the record number.
so keys become "00000001", "0000002" ...

(Yes I know this cannot be done in parallel) The PairRDD may be too large
to collect and work on one machine but small enough to handle on a single
machine.
 I could use toLocalIterator to guarantee execution on one machine but last
time I tried this all kinds of jobs were launched to get the next element
of the iterator and I was not convinced this approach was efficient.
Any bright ideas?

Reply via email to