I am trying to sort some data. The data had names and I was try to sort in the following manner.

*ORIGINAL DATA* *  SORTED DATA*
/Rahul                                               shekhar/
/rahul                                                Sameer/
/RAHUL              =====                     rahul/
/shekar               =====                     Rahul/
/hans                                                 RAHul/
/kasper                                              kasper/
/Sameer                                             hans/
/
/
This was a bit customized Sorting where I wanted to first sort them in lexicographic manner and then maybe take capitalization also into consideration. Initially I was trying with the Sort API but was unsuccessful with that. But then I tried in a couple of ways as explained below :

In the first solution, I outputted each of the names them against their starting character in a /Ptable/. Then collected all the values for a particular key. After that I selected all the values and then used a /Comparator /to sort data in each of the collection.

 /PTable<String, String> classifiedData = count.parallelDo( new 
NamesClassification(),Writables.tableOf(Writables.strings(),Writables.strings()));
 PTable<String, Collection<String> collectedValues = 
classifiedData.collectValues();
 PCollection<Collection<String> names = collectedValues.values();
 PCollection<Collection<String>> sortedNames = names.parallelDo("names 
Sorting",new NamesSorting(), Writables.collections(Writables.strings()));/


Not completely convinced with the path I took. I spend some time of solving it and found another way of doing same. In the second solution, I created my own writable type that implemented WritableComparable. Also implemented all the mapping functions for the same, so that it can be used with crunch WritableTypes.

/class NamesComparable implements WritableComparable<NamesComparable>{ ......}

MapFn<String,//NamesComparable//> string_to_names =.........
MapFn<//NamesComparable,String//> names_to_string =........./

/
/
Then I used this while converting the read data into it and then sorting it.

    PCollection<String> readLines = pipeline.readTextFile(fileLoc);
    PCollection<String> lines = readLines.parallelDo(new DoFn<String, String>() 
{
      @Override
     public void process(String input, Emitter<String> emitter) { 
emitter.emit(input);}},
     *stringToNames*());

    PCollection<String> sortedData = Sort.sort(lines, Order.DESCENDING);


I found of these methods as quite tricky that give a feeling of going around a bush. Is there a better way of accomplishing the same ? Have I missed some aspects ? If not, then I believe there is scope of having an Sorting API that can have support of some customizations.

regards
Rahul

Reply via email to