On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati <av...@gluster.org> wrote:
> I see that this key'ing is an artifact of the sequencefile format (reading > more about it just now). I view it differently. Having to have ordinal keys on columns is an artifact of sequence file format. Or Mahout legacy, whatever. row keys are not constrained to anything. One could require int keys (and a lot of operations do). Sequence file indeed has two payload spots in a record, but it doesn't constrain you to not having keys, or having 333 keys. The only essential function of sequence file is sync-able splittability and payload compression abstraction. People use plain text files with mapreduce for the same reason, but they don't have clear key-value structure. > As I'm reading it also feels like sequencefile is > really designed with the map/reduce framework in mind, again, not true, it is designed with data affinity in mind. Spark requires (or, rather, benefits from) data affinity just as much as map reduce, and so does Stratoshpere, and, to much smaller degree, HBase. Any parallel system that sends code to the data, and not the other way around, would require some notion of data partitioning, both in persistent state and in-memory. It would seem to me you hold a lot of misconceptions about why and what exists in Hadoop (not that all that exists there, exists for a good reason though; and what exists for a good reason, usually could be tons times better).