I have Hadoop Input Format which reads records and produces JavaPairRDD<String,String> locatedData where _1() is a formatted version of the file location - like "000012690",, "000024386 ."000027523 ... _2() is data to be processed
For historical reasons I want to convert _1() into in integer representing the record number. so keys become "00000001", "0000002" ... (Yes I know this cannot be done in parallel) The PairRDD may be too large to collect and work on one machine but small enough to handle on a single machine. I could use toLocalIterator to guarantee execution on one machine but last time I tried this all kinds of jobs were launched to get the next element of the iterator and I was not convinced this approach was efficient. Any bright ideas?