On Wed, Jun 18, 2014 at 9:10 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati <av...@gluster.org> wrote: > > > I see that this key'ing is an artifact of the sequencefile format > (reading > > more about it just now). > > > I view it differently. Having to have ordinal keys on columns is an > artifact of sequence file format. Or Mahout legacy, whatever. row keys are > not constrained to anything. One could require int keys (and a lot of > operations do). > > Sequence file indeed has two payload spots in a record, but it doesn't > constrain you to not having keys, or having 333 keys. The only essential > function of sequence file is sync-able splittability and payload > compression abstraction. People use plain text files with mapreduce for the > same reason, but they don't have clear key-value structure. > > > > > > As I'm reading it also feels like sequencefile is > > really designed with the map/reduce framework in mind, > > > again, not true, it is designed with data affinity in mind. Spark requires > (or, rather, benefits from) data affinity just as much as map reduce, and > so does Stratoshpere, and, to much smaller degree, HBase. Any parallel > system that sends code to the data, and not the other way around, would > require some notion of data partitioning, both in persistent state and > in-memory. > > It would seem to me you hold a lot of misconceptions about why and what > exists in Hadoop (not that all that exists there, exists for a good reason > though; and what exists for a good reason, usually could be tons times > better). > I'm only learning about Hadoop now. I'm very new to it. Wouldn't be surprised if I have misconceptions of a few things!