I have a large scale set ordering task and am trying to determine if Pig would be a good tool to use. I've read through the basic documentation and played with some simple examples and so have a general idea of how Pig works.
My task is this: I have a set of ordered sets S = {s1, s2, ... sn}. Every element of S has its elements drawn from some vocabulary V. For example: V = {a, b, c, d} s1 = {d, b, c} s2 = {d, c} s3 = {d} s4 = {d, c, b, a} I want to build a frequency table for how often the vocabulary elements appear and then reorder each set by increasing element frequency. In this example the frequency table is: a->1 b->2 c->3 d->4 so the sets would be reordered like so : s1 = {b, c, d} s2 = {c, d} s3 = {d} s4 = {a, b, c, d} Is there an easy way to do this in Pig? I can see how to build the frequency table, but I'm not sure how to use it to reorder the sets. In particular, I'm not sure how to fit the idea of variable length ordered sets into the Pig framework, since the core concept here is a database row whose number of columns is known in advance.