Re: H2O integration - intermediate progress update

Dmitriy Lyubimov Wed, 18 Jun 2014 21:11:15 -0700

On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati <av...@gluster.org> wrote:


> I see that this key'ing is an artifact of the sequencefile format (reading
> more about it just now).


I view it differently. Having to have ordinal keys on columns is an
artifact of sequence file format. Or Mahout legacy, whatever. row keys are
not constrained to anything. One could require int keys (and a lot of
operations do).

Sequence file indeed has two payload spots in a record, but it doesn't
constrain you to not having keys, or having 333 keys.  The only essential
function of sequence file is sync-able splittability and payload
compression abstraction. People use plain text files with mapreduce for the
same reason, but they don't have clear key-value structure.




> As I'm reading it also feels like sequencefile is
> really designed with the map/reduce framework in mind,


again, not true, it is designed with data affinity in mind. Spark requires
(or, rather, benefits from) data affinity just as much as map reduce, and
so does Stratoshpere, and, to much smaller degree, HBase. Any parallel
system that sends code to the data, and not the other way around, would
require some notion of data partitioning, both in persistent state and
in-memory.

It would seem to me you hold a lot of misconceptions about why and what
exists in Hadoop (not that all that exists there, exists for a good reason
though; and what exists for a good reason, usually could be tons times
better).

Re: H2O integration - intermediate progress update

Reply via email to