Re: H2O integration - intermediate progress update

Anand Avati Wed, 18 Jun 2014 21:26:34 -0700

On Wed, Jun 18, 2014 at 9:10 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:


> On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati <av...@gluster.org> wrote:
>
> > I see that this key'ing is an artifact of the sequencefile format
> (reading
> > more about it just now).
>
>
> I view it differently. Having to have ordinal keys on columns is an
> artifact of sequence file format. Or Mahout legacy, whatever. row keys are
> not constrained to anything. One could require int keys (and a lot of
> operations do).
>
> Sequence file indeed has two payload spots in a record, but it doesn't
> constrain you to not having keys, or having 333 keys.  The only essential
> function of sequence file is sync-able splittability and payload
> compression abstraction. People use plain text files with mapreduce for the
> same reason, but they don't have clear key-value structure.
>
>
>
>
> > As I'm reading it also feels like sequencefile is
> > really designed with the map/reduce framework in mind,
>
>
> again, not true, it is designed with data affinity in mind. Spark requires
> (or, rather, benefits from) data affinity just as much as map reduce, and
> so does Stratoshpere, and, to much smaller degree, HBase. Any parallel
> system that sends code to the data, and not the other way around, would
> require some notion of data partitioning, both in persistent state and
> in-memory.
>
> It would seem to me you hold a lot of misconceptions about why and what
> exists in Hadoop (not that all that exists there, exists for a good reason
> though; and what exists for a good reason, usually could be tons times
> better).
>

I'm only learning about Hadoop now. I'm very new to it. Wouldn't be
surprised if I have misconceptions of a few things!

Re: H2O integration - intermediate progress update

Reply via email to