Re: H2O integration - intermediate progress update

Anand Avati Wed, 18 Jun 2014 20:14:28 -0700

I see that this key'ing is an artifact of the sequencefile format (reading
more about it just now). As I'm reading it also feels like sequencefile is
really designed with the map/reduce framework in mind, suited well for the
mapper API. It also feels like, in the real world, data is
generated/available in a different and "more natural" formats, and an
ingestion phase converts the more "natural" file into a sequencefile just
for mapreduce processing. Naive question - Is it still relevant to support
this format, given the move away from MR within Mahout? Why design the core
data structure around a format from the framework we moved away? Why not
work off just CSV files etc.? Also, if we did not have Keys in DRM, most of
the code in the DSL need not have a type parameter, making it so much
simpler for a first timer to read..


thanks!

On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati <[email protected]> wrote:

> Would it not be possible (or even a good idea) to keep row keys completely
> separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the
> operators (so far) care about the keys. At least none of the existing
> mapBlock() users do anything with the key. I'm not sure if we can do
> anything meaningful with the key in a mapBlock. It feels they are tightly
> coupled while they need not have been. I must admit I'm new to this, but it
> feels like - keys could be stored in a separate file, and matrix numbers in
> another. Mahout (should) only care about and operate on Matrix numbers,
> reads from the "number" file, writes output to a new "number" file, and the
> user can use the new number file with the old/original "key file" -
> effectively the same result as loading keys and moving them around through
> all the operations and writing back. Am I missing something fundamental?
>
> Thanks
>
>
> On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
>> Looking at the code, i am still not sure without trying.
>>
>> but i am more inclined to think now that this specific combination, A'B
>> with A and B non-int row keys, is not supported.
>>
>> As a general principle, we followed where our guinea pigs get us, and were
>> not trying to fill all possible gaps and holes, with the belief that will
>> get us 80/20 caps in shortest time.
>>
>> As for the rest, we wait for somebody to ask for it because they need it.
>>
>> But that example is legal and patch should be fundamentally possible and
>> easy enough to handle this case within this architecture.
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>
>> > also, if something is not supported, such as your example, (if it is not
>> > supported), optimizer would simply state so with rejection. But if it
>> takes
>> > it in, then I am pretty sure it will do the right job (or at least
>> there's
>> > a unit test for that case that is asserted on a trivial example).
>> >
>> > Here, by trivial i mean local pipelines for 2-split inputs, that's the
>> > general rule i used.
>> >
>> >
>> > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov <[email protected]>
>> > wrote:
>> >
>> >> a little bit of additional information is that for rewriting rules
>> stage
>> >> optimizer does 3 passes over semantic tree, each pass matching a tree
>> >> fragment using Scala case class matching and rewriting. This allows to
>> >> match and rewrite pretty elaborate tree structure fragments, although
>> at
>> >> the moment i don't think we dig farther than immediate children, and
>> >> perhaps some their known attributes, in most cases.
>> >>
>> >> More detailed description that that i think is only in reading the
>> source.
>> >>
>> >>
>> >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov <[email protected]>
>> >> wrote:
>> >>
>> >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is
>> >>> int-keyed.
>> >>>
>> >>> This is kind of not the point. the point is that you can easily modify
>> >>> rewriting rules and operators to cover misses. (there shouldn't be
>> many,
>> >>> since we've already written quite a bit of expressions out there).
>> >>>
>> >>>
>> >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov <[email protected]>
>> >>> wrote:
>> >>>
>> >>>> I am not sure. There are more rewriting rules than i can remember,
>> and
>> >>>> i did not write an algorithm ( i think) that would involve this
>> >>>> combination. I guess the best thing is to try in a shell or a unit
>> test. if
>> >>>> it falls thru, perhaps a new plan element needs to be added
>> (although I am
>> >>>> not very sure there isn't already). I know that there are join-based
>> >>>> multiplicative operators there.
>> >>>>
>> >>>>
>> >>>> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov <
>> [email protected]>
>> >>>>> wrote:
>> >>>>>
>> >>>>> > in simple terms, if non-integer row keying is used anywhere, it
>> >>>>> tries to
>> >>>>> > rewrite pipelines so that product orientations never require
>> non-int
>> >>>>> keys
>> >>>>> > to denote columns. In case pipeline makes it impossible, optimizer
>> >>>>> will
>> >>>>> > refuse to produce a plan.
>> >>>>> >
>> >>>>> > e.g. suppose A is distributed string-keyed.
>> >>>>> >
>> >>>>> > (A.t %.% A) collect  // ok
>> >>>>> >
>> >>>>>
>> >>>>> What happens with the important case of  B.t %.% A where both A and
>> B
>> >>>>> are
>> >>>>> string keyed?
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>

Re: H2O integration - intermediate progress update

Reply via email to