Re: H2O integration - intermediate progress update

2014-06-19 Thread Ted Dunning
On Thu, Jun 19, 2014 at 9:36 AM, Dmitriy Lyubimov  wrote:

> i have an impression that mr. Dunning has masterfully concealed a very
> targeted insult in his carefully worded statement with the sole purpose of
> forcing certain participants to go into defensive and to turn a technical
> discussion into trading insults, in which he has obviously partially
> succeeded.
>
> I have an impression this has not been an isolated incident on mr.
> Dunning's part in the past, and i have strong suspicion that it was the
> wrong balance of technical merit and posturing in the project that drove
> more than one accomplished committer or candidate out in the past.
>
> I also have been receiving an impression that  I am  next such target on
> mr. Dunnings part just because my arguments are not technically favorable
> where he needs them to be favorable for whatever other-than-technical
> reason. I love the code in the project, that's in part why i am candid in
> its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
> that is very close to driving me out. And I don't think beers can smooth
> that.
>

I can say that my only intent was to try to help get the tone on the
mailing list back to a more gentle and encouraging path.  I did not intend
any insult and purposely tried to refer to all of us together as having the
problem.

In general, the only target I have is building up the Mahout community.

I don't want to encourage a negative thread to continue very far, but I do
feel that there is a difference between technical discussions about
technical merit, technical discussions that descend into personal attacks
and discussions about the form, tone and manner of discussions.

Only the second is, in my opinion, ad hominem.  I think we all agree that
it is a bad thing.

The first form of discussion is what we should mostly have, but
occasionally there needs to be a bit of discussion of the third kind.  In
particular, occasional feedback such as the impassioned comment by Yash
just now that things are not working right can be very, very helpful.
 Whenever anybody gives this feedback it is important to step back a bit
and think about what it means.  For instance, even though I disagree with
Dmitriy's assessment of my motives, I am going to think carefully about how
to improve his impressions.

This third kind of discussion can be delicate and difficult.  It can be
distasteful to have in public, but I think that we all owe it to the
project to try to make things work better if we possibly can.


RE: H2O integration - intermediate progress update

2014-06-19 Thread Saikat Kanjilal
I would agree and would second not wanting to hijack the discussion, I'm not a 
newbie to mahout but to be frank I've seen this tone from committers when 
evaluating or describing ideas or when judging new work that someone wants to 
contribute , I would also add that code committs can and should come in from 
anyone and be judged fairly without immediate and early dismissal of ideas.  
Frankly I'm interested in committing just for the purposes of learning and the 
general tone of this discussion is not encouraging to folks interested in 
shaping/using/adding to mahout for the future.

My 2 cents. 

> From: yash...@gmail.com
> Date: Thu, 19 Jun 2014 22:39:49 +0530
> Subject: Re: H2O integration - intermediate progress update
> To: dev@mahout.apache.org
> 
> Hi All,
> Sorry to hijack the thread.
> I am a newbie in mahout community - please pardon my words if anyone finds
> them unsuitable.
> 
> Its really strange to see such heated discussions between the Big Shots on
> the mailing lists.
> I am absolute beginner in this space and it does not leave a very good
> impression about the open source community itself.
> 
> What I believe is - Open Source is about love of code and all awesome
> coders coming together - designing some of the coolest code projects on the
> planet as one *strong team*. Lets not break this belief of newbies. We are
> learning from you guys.
> 
> The avengers should not fight.
> 
> Best Regards,
> Yash
> 
> 
> 
> 
> On Thu, Jun 19, 2014 at 10:06 PM, Dmitriy Lyubimov 
> wrote:
> 
> > Well, let me tell my impression.
> >
> > Remember, we started  talking impressions here all over the place, not
> > facts. So *don't ask me to prove anything.*
> >
> > i have an impression that mr. Dunning has masterfully concealed a very
> > targeted insult in his carefully worded statement with the sole purpose of
> > forcing certain participants to go into defensive and to turn a technical
> > discussion into trading insults, in which he has obviously partially
> > succeeded.
> >
> > I have an impression this has not been an isolated incident on mr.
> > Dunning's part in the past, and i have strong suspicion that it was the
> > wrong balance of technical merit and posturing in the project that drove
> > more than one accomplished committer or candidate out in the past.
> >
> > I also have been receiving an impression that  I am  next such target on
> > mr. Dunnings part just because my arguments are not technically favorable
> > where he needs them to be favorable for whatever other-than-technical
> > reason. I love the code in the project, that's in part why i am candid in
> > its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
> > that is very close to driving me out. And I don't think beers can smooth
> > that.
> >
> > As for welcoming, well, h2o  is not exactly new topic here. I also think we
> > need to have some bar for proposals to meet regardless of being welcoming.
> >
> > Finally, I have an impression everybody has areas where they possess  less
> > than brilliant expertise; i actually like to say about myself that "it
> > pains me how little i know". I have no problem identifying areas of
> > weaknesses in myself publicly and don't consider this to be offensive,
> > since i know that the only way to improve  knowledge is to first know where
> > it is lacking. I am very perceptive to strong logical argument regardless
> > if it fits my current world view or not. But I am particularly not fond of
> > rhetorical fallacies, informal ones in particular. I am not very fond of
> > marketing bluff or empty PR.  It is a personal choice whether you accept
> > that mindset or not, but grading areas of weakness is not an insult. That's
> > what they do in universities all the time, after all.
> >
  

Re: H2O integration - intermediate progress update

2014-06-19 Thread Ted Dunning
On Thu, Jun 19, 2014 at 10:09 AM, Yash Sharma  wrote:

> What I believe is - Open Source is about love of code and all awesome
> coders coming together - designing some of the coolest code projects on the
> planet as one *strong team*. Lets not break this belief of newbies. We are
> learning from you guys.
>
> The avengers should not fight.
>

Thanks Yash.

Sounds like you have a very good start here.  Community building is very
important and your calming words are a good way to encourage it.


Re: H2O integration - intermediate progress update

2014-06-19 Thread Yash Sharma
Hi All,
Sorry to hijack the thread.
I am a newbie in mahout community - please pardon my words if anyone finds
them unsuitable.

Its really strange to see such heated discussions between the Big Shots on
the mailing lists.
I am absolute beginner in this space and it does not leave a very good
impression about the open source community itself.

What I believe is - Open Source is about love of code and all awesome
coders coming together - designing some of the coolest code projects on the
planet as one *strong team*. Lets not break this belief of newbies. We are
learning from you guys.

The avengers should not fight.

Best Regards,
Yash




On Thu, Jun 19, 2014 at 10:06 PM, Dmitriy Lyubimov 
wrote:

> Well, let me tell my impression.
>
> Remember, we started  talking impressions here all over the place, not
> facts. So *don't ask me to prove anything.*
>
> i have an impression that mr. Dunning has masterfully concealed a very
> targeted insult in his carefully worded statement with the sole purpose of
> forcing certain participants to go into defensive and to turn a technical
> discussion into trading insults, in which he has obviously partially
> succeeded.
>
> I have an impression this has not been an isolated incident on mr.
> Dunning's part in the past, and i have strong suspicion that it was the
> wrong balance of technical merit and posturing in the project that drove
> more than one accomplished committer or candidate out in the past.
>
> I also have been receiving an impression that  I am  next such target on
> mr. Dunnings part just because my arguments are not technically favorable
> where he needs them to be favorable for whatever other-than-technical
> reason. I love the code in the project, that's in part why i am candid in
> its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
> that is very close to driving me out. And I don't think beers can smooth
> that.
>
> As for welcoming, well, h2o  is not exactly new topic here. I also think we
> need to have some bar for proposals to meet regardless of being welcoming.
>
> Finally, I have an impression everybody has areas where they possess  less
> than brilliant expertise; i actually like to say about myself that "it
> pains me how little i know". I have no problem identifying areas of
> weaknesses in myself publicly and don't consider this to be offensive,
> since i know that the only way to improve  knowledge is to first know where
> it is lacking. I am very perceptive to strong logical argument regardless
> if it fits my current world view or not. But I am particularly not fond of
> rhetorical fallacies, informal ones in particular. I am not very fond of
> marketing bluff or empty PR.  It is a personal choice whether you accept
> that mindset or not, but grading areas of weakness is not an insult. That's
> what they do in universities all the time, after all.
>


Re: H2O integration - intermediate progress update

2014-06-19 Thread Dmitriy Lyubimov
Well, let me tell my impression.

Remember, we started  talking impressions here all over the place, not
facts. So *don't ask me to prove anything.*

i have an impression that mr. Dunning has masterfully concealed a very
targeted insult in his carefully worded statement with the sole purpose of
forcing certain participants to go into defensive and to turn a technical
discussion into trading insults, in which he has obviously partially
succeeded.

I have an impression this has not been an isolated incident on mr.
Dunning's part in the past, and i have strong suspicion that it was the
wrong balance of technical merit and posturing in the project that drove
more than one accomplished committer or candidate out in the past.

I also have been receiving an impression that  I am  next such target on
mr. Dunnings part just because my arguments are not technically favorable
where he needs them to be favorable for whatever other-than-technical
reason. I love the code in the project, that's in part why i am candid in
its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
that is very close to driving me out. And I don't think beers can smooth
that.

As for welcoming, well, h2o  is not exactly new topic here. I also think we
need to have some bar for proposals to meet regardless of being welcoming.

Finally, I have an impression everybody has areas where they possess  less
than brilliant expertise; i actually like to say about myself that "it
pains me how little i know". I have no problem identifying areas of
weaknesses in myself publicly and don't consider this to be offensive,
since i know that the only way to improve  knowledge is to first know where
it is lacking. I am very perceptive to strong logical argument regardless
if it fits my current world view or not. But I am particularly not fond of
rhetorical fallacies, informal ones in particular. I am not very fond of
marketing bluff or empty PR.  It is a personal choice whether you accept
that mindset or not, but grading areas of weakness is not an insult. That's
what they do in universities all the time, after all.


Re: H2O integration - intermediate progress update

2014-06-19 Thread Sebastian Schelter
I share the impression that the tone of conversation has not been very 
welcoming lately, be it intentional or not. I think that we should 
remind ourselves why we are working on open source and try to improve 
our ways of communication.


I think we should try to get as much people as possible together to sit 
on a table and have some face-to-face discussion during a beer or coffee.


--sebastian

On 06/19/2014 07:18 AM, Dmitriy Lyubimov wrote:

On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning  wrote:


On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov 
wrote:


I did not mean to discourage
sincere search for answers.



The tone of answers has lately been very discouraging for those sincerely
searching for answers.  I think we as a community have a responsibility to
do better about this.  There is no need to be insulting to people asking
honest questions in a civil tone.



Ted, we've been at this already. There have been more arguments than
questions. I am just providing my counter arguments. Do you insist on terms
"insulting"? Cause this, you know, insulting. You are heading ad hominem
direction again.





Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning  wrote:

> On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov 
> wrote:
>
> > I did not mean to discourage
> > sincere search for answers.
> >
>
> The tone of answers has lately been very discouraging for those sincerely
> searching for answers.  I think we as a community have a responsibility to
> do better about this.  There is no need to be insulting to people asking
> honest questions in a civil tone.


Ted, we've been at this already. There have been more arguments than
questions. I am just providing my counter arguments. Do you insist on terms
"insulting"? Cause this, you know, insulting. You are heading ad hominem
direction again.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov  wrote:

> I did not mean to discourage
> sincere search for answers.
>

The tone of answers has lately been very discouraging for those sincerely
searching for answers.  I think we as a community have a responsibility to
do better about this.  There is no need to be insulting to people asking
honest questions in a civil tone.  They may or may not be well informed.
 They may or may not have some reason for asking.  And they may well be
pointing out something in our blind spot that we have retained just because
it was that way before.

I congratulate Anand for sticking with it and I strongly appreciate his
questions.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov  wrote:

>
>
>
>  BTW Spark rdd type is RDD[K:ClassTag] as well. nobody yet complained. not
> a single time. For all their list activity.
>
>
>>
Actually it is even "scarier" in Spark. Consider this type system:

To enable groupBy, for example, RDD needs to match
RDD[(K:ClassTag,V:ClassTag)].

To enable sort, RDD needs to match RDD[(K<%Comparable:ClassTag,
V:ClassTag)].

And to enable persisting something to a sequence file, it has to match
RDD[(K<%WritableComparable:ClassTag,V<%Writable :ClassTag)].

And probably even something else i don't immediately remember.

Compared to these, we are just simplicity itself.

Still, nobody yet thought of complaining about those.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 9:32 PM, Anand Avati  wrote:

> On Wed, Jun 18, 2014 at 9:24 PM, Dmitriy Lyubimov 
> wrote:
>
> > On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati  wrote:
> >
> > >  Also, if we did not have Keys in DRM, most of
> > > the code in the DSL need not have a type parameter, making it so much
> > > simpler for a first timer to read..
> > >
> >
> > This is also something i absolutely not sure where it is coming from.
> >
> > Let's see:
> >
> > Mahout   expression |   R expression
> >
> > A %*% B  | A %*% B
> > A[, 5] | A(::,5)
> > cbind(A,B) | A cbind B
> > A * B | A * B
> > 1 / x | 1 /: x
> > t(A) | A.t
> > norm(A) | A.norm
> > colSums(A) | A.colSums
> >
> > Where is the "struggle" here ?
> >
>
> Not in this at all, but all over the place in sparkbindings (the backend of
> the DSL).
>

User doesn't write spark bindings. Users write scripts. I.e. exactly what
i've shown.

And we (I am confident) are ok with some generics passed around in Mahout's
guts.We probably should expect to be ok with much bigger complexity in fact
than this.

 BTW Spark rdd type is RDD[K:ClassTag] as well. nobody yet complained. not
a single time. For all their list activity.


>
>
> > I suspect the real reason for all these questions is not architectural,
> but
> > rather simplification of H20 bindings.
> >
> > That is, probably, a really worthy question: are we ready to screw legacy
> > algorithm compatibility and existing bindings' merits just to make h2o
> > integration easier? This is a good question, but i am far from sure i
> would
> > vote "yes" here.
> >
>
> Well, sure. I would like to simplify H2O bindings to the extent I can (or
> simplify any task I do in any project). I expect not all questions might
> make sense for those who have a bigger context, but I still ask without
> hesitation.
>

Ok. I apologize. i just assumed you were at different level of familiarity
with both Mahout and distributed stacks. I did not mean to discourage
sincere search for answers.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
On Wed, Jun 18, 2014 at 9:24 PM, Dmitriy Lyubimov  wrote:

> On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati  wrote:
>
> >  Also, if we did not have Keys in DRM, most of
> > the code in the DSL need not have a type parameter, making it so much
> > simpler for a first timer to read..
> >
>
> This is also something i absolutely not sure where it is coming from.
>
> Let's see:
>
> Mahout   expression |   R expression
>
> A %*% B  | A %*% B
> A[, 5] | A(::,5)
> cbind(A,B) | A cbind B
> A * B | A * B
> 1 / x | 1 /: x
> t(A) | A.t
> norm(A) | A.norm
> colSums(A) | A.colSums
>
> Where is the "struggle" here ?
>

Not in this at all, but all over the place in sparkbindings (the backend of
the DSL).



> I suspect the real reason for all these questions is not architectural, but
> rather simplification of H20 bindings.
>
> That is, probably, a really worthy question: are we ready to screw legacy
> algorithm compatibility and existing bindings' merits just to make h2o
> integration easier? This is a good question, but i am far from sure i would
> vote "yes" here.
>

Well, sure. I would like to simplify H2O bindings to the extent I can (or
simplify any task I do in any project). I expect not all questions might
make sense for those who have a bigger context, but I still ask without
hesitation.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
On Wed, Jun 18, 2014 at 9:10 PM, Dmitriy Lyubimov  wrote:

> On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati  wrote:
>
> > I see that this key'ing is an artifact of the sequencefile format
> (reading
> > more about it just now).
>
>
> I view it differently. Having to have ordinal keys on columns is an
> artifact of sequence file format. Or Mahout legacy, whatever. row keys are
> not constrained to anything. One could require int keys (and a lot of
> operations do).
>
> Sequence file indeed has two payload spots in a record, but it doesn't
> constrain you to not having keys, or having 333 keys.  The only essential
> function of sequence file is sync-able splittability and payload
> compression abstraction. People use plain text files with mapreduce for the
> same reason, but they don't have clear key-value structure.
>
>
>
>
> > As I'm reading it also feels like sequencefile is
> > really designed with the map/reduce framework in mind,
>
>
> again, not true, it is designed with data affinity in mind. Spark requires
> (or, rather, benefits from) data affinity just as much as map reduce, and
> so does Stratoshpere, and, to much smaller degree, HBase. Any parallel
> system that sends code to the data, and not the other way around, would
> require some notion of data partitioning, both in persistent state and
> in-memory.
>
> It would seem to me you hold a lot of misconceptions about why and what
> exists in Hadoop (not that all that exists there, exists for a good reason
> though; and what exists for a good reason, usually could be tons times
> better).
>

I'm only learning about Hadoop now. I'm very new to it. Wouldn't be
surprised if I have misconceptions of a few things!


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati  wrote:

>  Also, if we did not have Keys in DRM, most of
> the code in the DSL need not have a type parameter, making it so much
> simpler for a first timer to read..
>

This is also something i absolutely not sure where it is coming from.

Let's see:

Mahout   expression |   R expression

A %*% B  | A %*% B
A[, 5] | A(::,5)
cbind(A,B) | A cbind B
A * B | A * B
1 / x | 1 /: x
t(A) | A.t
norm(A) | A.norm
colSums(A) | A.colSums

Where is the "struggle" here ?

I suspect the real reason for all these questions is not architectural, but
rather simplification of H20 bindings.

That is, probably, a really worthy question: are we ready to screw legacy
algorithm compatibility and existing bindings' merits just to make h2o
integration easier? This is a good question, but i am far from sure i would
vote "yes" here.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati  wrote:

> I see that this key'ing is an artifact of the sequencefile format (reading
> more about it just now).


I view it differently. Having to have ordinal keys on columns is an
artifact of sequence file format. Or Mahout legacy, whatever. row keys are
not constrained to anything. One could require int keys (and a lot of
operations do).

Sequence file indeed has two payload spots in a record, but it doesn't
constrain you to not having keys, or having 333 keys.  The only essential
function of sequence file is sync-able splittability and payload
compression abstraction. People use plain text files with mapreduce for the
same reason, but they don't have clear key-value structure.




> As I'm reading it also feels like sequencefile is
> really designed with the map/reduce framework in mind,


again, not true, it is designed with data affinity in mind. Spark requires
(or, rather, benefits from) data affinity just as much as map reduce, and
so does Stratoshpere, and, to much smaller degree, HBase. Any parallel
system that sends code to the data, and not the other way around, would
require some notion of data partitioning, both in persistent state and
in-memory.

It would seem to me you hold a lot of misconceptions about why and what
exists in Hadoop (not that all that exists there, exists for a good reason
though; and what exists for a good reason, usually could be tons times
better).


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati  wrote:

> Would it not be possible (or even a good idea) to keep row keys completely
> separate from DRM, and let DRMs be pure nRow x nCol numbers?


Considering this is only at the cost of breaking compatibility with all MR
stuff that's been done in Mahout since 2008. Not an option.
But suppose legacy was not a problem, I see signficant benefits in allowing
non-ordinal keys.

One thing, data almost never  comes out of ETL pipelines with
ordinary-enforced keys. Normalizing ordinarity would be a pain. There's
normalization issue for dense data, and there's uniqueness requirement for
sparse data (in which case it really is no different from any key with only
requirements for hash/equals contracts)

Second, having to map to integral keys is creating problems relating and
maintaining relations of the stuff back to its origins.

Given it's already there, being in a position of an architect, I'd never
give it back.




> None of the
> operators (so far) care about the keys.


Simply not true. LSA does, clustering does, and about other dozen cases in
and outside Mahout. Assuming we are still to support algorithms we have not
deprecated to date.


> At least none of the existing
> mapBlock() users do anything with the key.


not true. Not all examples in Mahout, but not true.


> I'm not sure if we can do
> anything meaningful with the key in a mapBlock.


You not being sure is not sufficient condition. Sufficient condition
everyone has to be sure to the contrary. It is always hard to argue
non-existence of a counter example from positions of probabilities or
intuition.


> It feels they are tightly
> coupled while they need not have been. I must admit I'm new to this, but it
> feels like - keys could be stored in a separate file, and matrix numbers in
> another. Mahout (should) only care about and operate on Matrix numbers,
> reads from the "number" file, writes output to a new "number" file, and the
> user can use the new number file with the old/original "key file" -
> effectively the same result as loading keys and moving them around through
> all the operations and writing back. Am I missing something fundamental?
>

All i said. legacy, ordinality enforcement etc. etc.


>
> Thanks
>
>
> On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov 
> wrote:
>
> > Looking at the code, i am still not sure without trying.
> >
> > but i am more inclined to think now that this specific combination, A'B
> > with A and B non-int row keys, is not supported.
> >
> > As a general principle, we followed where our guinea pigs get us, and
> were
> > not trying to fill all possible gaps and holes, with the belief that will
> > get us 80/20 caps in shortest time.
> >
> > As for the rest, we wait for somebody to ask for it because they need it.
> >
> > But that example is legal and patch should be fundamentally possible and
> > easy enough to handle this case within this architecture.
> >
> >
> >
> >
> > On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov 
> > wrote:
> >
> > > also, if something is not supported, such as your example, (if it is
> not
> > > supported), optimizer would simply state so with rejection. But if it
> > takes
> > > it in, then I am pretty sure it will do the right job (or at least
> > there's
> > > a unit test for that case that is asserted on a trivial example).
> > >
> > > Here, by trivial i mean local pipelines for 2-split inputs, that's the
> > > general rule i used.
> > >
> > >
> > > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov 
> > > wrote:
> > >
> > >> a little bit of additional information is that for rewriting rules
> stage
> > >> optimizer does 3 passes over semantic tree, each pass matching a tree
> > >> fragment using Scala case class matching and rewriting. This allows to
> > >> match and rewrite pretty elaborate tree structure fragments, although
> at
> > >> the moment i don't think we dig farther than immediate children, and
> > >> perhaps some their known attributes, in most cases.
> > >>
> > >> More detailed description that that i think is only in reading the
> > source.
> > >>
> > >>
> > >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov 
> > >> wrote:
> > >>
> > >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b
> is
> > >>> int-keyed.
> > >>>
> > >>> This is kind of not the point. the point is that you can easily
> modify
> > >>> rewriting rules and operators to cover misses. (there shouldn't be
> > many,
> > >>> since we've already written quite a bit of expressions out there).
> > >>>
> > >>>
> > >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov  >
> > >>> wrote:
> > >>>
> >  I am not sure. There are more rewriting rules than i can remember,
> and
> >  i did not write an algorithm ( i think) that would involve this
> >  combination. I guess the best thing is to try in a shell or a unit
> > test. if
> >  it falls thru, perhaps a new plan element needs to be added
> (although
> > I am
> >  not v

Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
I see that this key'ing is an artifact of the sequencefile format (reading
more about it just now). As I'm reading it also feels like sequencefile is
really designed with the map/reduce framework in mind, suited well for the
mapper API. It also feels like, in the real world, data is
generated/available in a different and "more natural" formats, and an
ingestion phase converts the more "natural" file into a sequencefile just
for mapreduce processing. Naive question - Is it still relevant to support
this format, given the move away from MR within Mahout? Why design the core
data structure around a format from the framework we moved away? Why not
work off just CSV files etc.? Also, if we did not have Keys in DRM, most of
the code in the DSL need not have a type parameter, making it so much
simpler for a first timer to read..

thanks!

On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati  wrote:

> Would it not be possible (or even a good idea) to keep row keys completely
> separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the
> operators (so far) care about the keys. At least none of the existing
> mapBlock() users do anything with the key. I'm not sure if we can do
> anything meaningful with the key in a mapBlock. It feels they are tightly
> coupled while they need not have been. I must admit I'm new to this, but it
> feels like - keys could be stored in a separate file, and matrix numbers in
> another. Mahout (should) only care about and operate on Matrix numbers,
> reads from the "number" file, writes output to a new "number" file, and the
> user can use the new number file with the old/original "key file" -
> effectively the same result as loading keys and moving them around through
> all the operations and writing back. Am I missing something fundamental?
>
> Thanks
>
>
> On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov 
> wrote:
>
>> Looking at the code, i am still not sure without trying.
>>
>> but i am more inclined to think now that this specific combination, A'B
>> with A and B non-int row keys, is not supported.
>>
>> As a general principle, we followed where our guinea pigs get us, and were
>> not trying to fill all possible gaps and holes, with the belief that will
>> get us 80/20 caps in shortest time.
>>
>> As for the rest, we wait for somebody to ask for it because they need it.
>>
>> But that example is legal and patch should be fundamentally possible and
>> easy enough to handle this case within this architecture.
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov 
>> wrote:
>>
>> > also, if something is not supported, such as your example, (if it is not
>> > supported), optimizer would simply state so with rejection. But if it
>> takes
>> > it in, then I am pretty sure it will do the right job (or at least
>> there's
>> > a unit test for that case that is asserted on a trivial example).
>> >
>> > Here, by trivial i mean local pipelines for 2-split inputs, that's the
>> > general rule i used.
>> >
>> >
>> > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov 
>> > wrote:
>> >
>> >> a little bit of additional information is that for rewriting rules
>> stage
>> >> optimizer does 3 passes over semantic tree, each pass matching a tree
>> >> fragment using Scala case class matching and rewriting. This allows to
>> >> match and rewrite pretty elaborate tree structure fragments, although
>> at
>> >> the moment i don't think we dig farther than immediate children, and
>> >> perhaps some their known attributes, in most cases.
>> >>
>> >> More detailed description that that i think is only in reading the
>> source.
>> >>
>> >>
>> >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov 
>> >> wrote:
>> >>
>> >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is
>> >>> int-keyed.
>> >>>
>> >>> This is kind of not the point. the point is that you can easily modify
>> >>> rewriting rules and operators to cover misses. (there shouldn't be
>> many,
>> >>> since we've already written quite a bit of expressions out there).
>> >>>
>> >>>
>> >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov 
>> >>> wrote:
>> >>>
>>  I am not sure. There are more rewriting rules than i can remember,
>> and
>>  i did not write an algorithm ( i think) that would involve this
>>  combination. I guess the best thing is to try in a shell or a unit
>> test. if
>>  it falls thru, perhaps a new plan element needs to be added
>> (although I am
>>  not very sure there isn't already). I know that there are join-based
>>  multiplicative operators there.
>> 
>> 
>>  On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 
>>  wrote:
>> 
>> > On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov <
>> dlie...@gmail.com>
>> > wrote:
>> >
>> > > in simple terms, if non-integer row keying is used anywhere, it
>> > tries to
>> > > rewrite pipelines so that product orientations never require
>> non-int
>> > keys
>> > > to denote columns. In ca

Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
Would it not be possible (or even a good idea) to keep row keys completely
separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the
operators (so far) care about the keys. At least none of the existing
mapBlock() users do anything with the key. I'm not sure if we can do
anything meaningful with the key in a mapBlock. It feels they are tightly
coupled while they need not have been. I must admit I'm new to this, but it
feels like - keys could be stored in a separate file, and matrix numbers in
another. Mahout (should) only care about and operate on Matrix numbers,
reads from the "number" file, writes output to a new "number" file, and the
user can use the new number file with the old/original "key file" -
effectively the same result as loading keys and moving them around through
all the operations and writing back. Am I missing something fundamental?

Thanks


On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov  wrote:

> Looking at the code, i am still not sure without trying.
>
> but i am more inclined to think now that this specific combination, A'B
> with A and B non-int row keys, is not supported.
>
> As a general principle, we followed where our guinea pigs get us, and were
> not trying to fill all possible gaps and holes, with the belief that will
> get us 80/20 caps in shortest time.
>
> As for the rest, we wait for somebody to ask for it because they need it.
>
> But that example is legal and patch should be fundamentally possible and
> easy enough to handle this case within this architecture.
>
>
>
>
> On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov 
> wrote:
>
> > also, if something is not supported, such as your example, (if it is not
> > supported), optimizer would simply state so with rejection. But if it
> takes
> > it in, then I am pretty sure it will do the right job (or at least
> there's
> > a unit test for that case that is asserted on a trivial example).
> >
> > Here, by trivial i mean local pipelines for 2-split inputs, that's the
> > general rule i used.
> >
> >
> > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov 
> > wrote:
> >
> >> a little bit of additional information is that for rewriting rules stage
> >> optimizer does 3 passes over semantic tree, each pass matching a tree
> >> fragment using Scala case class matching and rewriting. This allows to
> >> match and rewrite pretty elaborate tree structure fragments, although at
> >> the moment i don't think we dig farther than immediate children, and
> >> perhaps some their known attributes, in most cases.
> >>
> >> More detailed description that that i think is only in reading the
> source.
> >>
> >>
> >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov 
> >> wrote:
> >>
> >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is
> >>> int-keyed.
> >>>
> >>> This is kind of not the point. the point is that you can easily modify
> >>> rewriting rules and operators to cover misses. (there shouldn't be
> many,
> >>> since we've already written quite a bit of expressions out there).
> >>>
> >>>
> >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov 
> >>> wrote:
> >>>
>  I am not sure. There are more rewriting rules than i can remember, and
>  i did not write an algorithm ( i think) that would involve this
>  combination. I guess the best thing is to try in a shell or a unit
> test. if
>  it falls thru, perhaps a new plan element needs to be added (although
> I am
>  not very sure there isn't already). I know that there are join-based
>  multiplicative operators there.
> 
> 
>  On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 
>  wrote:
> 
> > On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov  >
> > wrote:
> >
> > > in simple terms, if non-integer row keying is used anywhere, it
> > tries to
> > > rewrite pipelines so that product orientations never require
> non-int
> > keys
> > > to denote columns. In case pipeline makes it impossible, optimizer
> > will
> > > refuse to produce a plan.
> > >
> > > e.g. suppose A is distributed string-keyed.
> > >
> > > (A.t %.% A) collect  // ok
> > >
> >
> > What happens with the important case of  B.t %.% A where both A and B
> > are
> > string keyed?
> >
> 
> 
> >>>
> >>
> >
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
Looking at the code, i am still not sure without trying.

but i am more inclined to think now that this specific combination, A'B
with A and B non-int row keys, is not supported.

As a general principle, we followed where our guinea pigs get us, and were
not trying to fill all possible gaps and holes, with the belief that will
get us 80/20 caps in shortest time.

As for the rest, we wait for somebody to ask for it because they need it.

But that example is legal and patch should be fundamentally possible and
easy enough to handle this case within this architecture.




On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov  wrote:

> also, if something is not supported, such as your example, (if it is not
> supported), optimizer would simply state so with rejection. But if it takes
> it in, then I am pretty sure it will do the right job (or at least there's
> a unit test for that case that is asserted on a trivial example).
>
> Here, by trivial i mean local pipelines for 2-split inputs, that's the
> general rule i used.
>
>
> On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov 
> wrote:
>
>> a little bit of additional information is that for rewriting rules stage
>> optimizer does 3 passes over semantic tree, each pass matching a tree
>> fragment using Scala case class matching and rewriting. This allows to
>> match and rewrite pretty elaborate tree structure fragments, although at
>> the moment i don't think we dig farther than immediate children, and
>> perhaps some their known attributes, in most cases.
>>
>> More detailed description that that i think is only in reading the source.
>>
>>
>> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov 
>> wrote:
>>
>>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is
>>> int-keyed.
>>>
>>> This is kind of not the point. the point is that you can easily modify
>>> rewriting rules and operators to cover misses. (there shouldn't be many,
>>> since we've already written quite a bit of expressions out there).
>>>
>>>
>>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov 
>>> wrote:
>>>
 I am not sure. There are more rewriting rules than i can remember, and
 i did not write an algorithm ( i think) that would involve this
 combination. I guess the best thing is to try in a shell or a unit test. if
 it falls thru, perhaps a new plan element needs to be added (although I am
 not very sure there isn't already). I know that there are join-based
 multiplicative operators there.


 On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 
 wrote:

> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov 
> wrote:
>
> > in simple terms, if non-integer row keying is used anywhere, it
> tries to
> > rewrite pipelines so that product orientations never require non-int
> keys
> > to denote columns. In case pipeline makes it impossible, optimizer
> will
> > refuse to produce a plan.
> >
> > e.g. suppose A is distributed string-keyed.
> >
> > (A.t %.% A) collect  // ok
> >
>
> What happens with the important case of  B.t %.% A where both A and B
> are
> string keyed?
>


>>>
>>
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
also, if something is not supported, such as your example, (if it is not
supported), optimizer would simply state so with rejection. But if it takes
it in, then I am pretty sure it will do the right job (or at least there's
a unit test for that case that is asserted on a trivial example).

Here, by trivial i mean local pipelines for 2-split inputs, that's the
general rule i used.


On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov  wrote:

> a little bit of additional information is that for rewriting rules stage
> optimizer does 3 passes over semantic tree, each pass matching a tree
> fragment using Scala case class matching and rewriting. This allows to
> match and rewrite pretty elaborate tree structure fragments, although at
> the moment i don't think we dig farther than immediate children, and
> perhaps some their known attributes, in most cases.
>
> More detailed description that that i think is only in reading the source.
>
>
> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov 
> wrote:
>
>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is
>> int-keyed.
>>
>> This is kind of not the point. the point is that you can easily modify
>> rewriting rules and operators to cover misses. (there shouldn't be many,
>> since we've already written quite a bit of expressions out there).
>>
>>
>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov 
>> wrote:
>>
>>> I am not sure. There are more rewriting rules than i can remember, and i
>>> did not write an algorithm ( i think) that would involve this combination.
>>> I guess the best thing is to try in a shell or a unit test. if it falls
>>> thru, perhaps a new plan element needs to be added (although I am not very
>>> sure there isn't already). I know that there are join-based multiplicative
>>> operators there.
>>>
>>>
>>> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 
>>> wrote:
>>>
 On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov 
 wrote:

 > in simple terms, if non-integer row keying is used anywhere, it tries
 to
 > rewrite pipelines so that product orientations never require non-int
 keys
 > to denote columns. In case pipeline makes it impossible, optimizer
 will
 > refuse to produce a plan.
 >
 > e.g. suppose A is distributed string-keyed.
 >
 > (A.t %.% A) collect  // ok
 >

 What happens with the important case of  B.t %.% A where both A and B
 are
 string keyed?

>>>
>>>
>>
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
a little bit of additional information is that for rewriting rules stage
optimizer does 3 passes over semantic tree, each pass matching a tree
fragment using Scala case class matching and rewriting. This allows to
match and rewrite pretty elaborate tree structure fragments, although at
the moment i don't think we dig farther than immediate children, and
perhaps some their known attributes, in most cases.

More detailed description that that i think is only in reading the source.


On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov  wrote:

> E.g. i know for sure A %.% B is legal where A is string-keyed and b is
> int-keyed.
>
> This is kind of not the point. the point is that you can easily modify
> rewriting rules and operators to cover misses. (there shouldn't be many,
> since we've already written quite a bit of expressions out there).
>
>
> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov 
> wrote:
>
>> I am not sure. There are more rewriting rules than i can remember, and i
>> did not write an algorithm ( i think) that would involve this combination.
>> I guess the best thing is to try in a shell or a unit test. if it falls
>> thru, perhaps a new plan element needs to be added (although I am not very
>> sure there isn't already). I know that there are join-based multiplicative
>> operators there.
>>
>>
>> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 
>> wrote:
>>
>>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov 
>>> wrote:
>>>
>>> > in simple terms, if non-integer row keying is used anywhere, it tries
>>> to
>>> > rewrite pipelines so that product orientations never require non-int
>>> keys
>>> > to denote columns. In case pipeline makes it impossible, optimizer will
>>> > refuse to produce a plan.
>>> >
>>> > e.g. suppose A is distributed string-keyed.
>>> >
>>> > (A.t %.% A) collect  // ok
>>> >
>>>
>>> What happens with the important case of  B.t %.% A where both A and B are
>>> string keyed?
>>>
>>
>>
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
E.g. i know for sure A %.% B is legal where A is string-keyed and b is
int-keyed.

This is kind of not the point. the point is that you can easily modify
rewriting rules and operators to cover misses. (there shouldn't be many,
since we've already written quite a bit of expressions out there).


On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov  wrote:

> I am not sure. There are more rewriting rules than i can remember, and i
> did not write an algorithm ( i think) that would involve this combination.
> I guess the best thing is to try in a shell or a unit test. if it falls
> thru, perhaps a new plan element needs to be added (although I am not very
> sure there isn't already). I know that there are join-based multiplicative
> operators there.
>
>
> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 
> wrote:
>
>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov 
>> wrote:
>>
>> > in simple terms, if non-integer row keying is used anywhere, it tries to
>> > rewrite pipelines so that product orientations never require non-int
>> keys
>> > to denote columns. In case pipeline makes it impossible, optimizer will
>> > refuse to produce a plan.
>> >
>> > e.g. suppose A is distributed string-keyed.
>> >
>> > (A.t %.% A) collect  // ok
>> >
>>
>> What happens with the important case of  B.t %.% A where both A and B are
>> string keyed?
>>
>
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
I am not sure. There are more rewriting rules than i can remember, and i
did not write an algorithm ( i think) that would involve this combination.
I guess the best thing is to try in a shell or a unit test. if it falls
thru, perhaps a new plan element needs to be added (although I am not very
sure there isn't already). I know that there are join-based multiplicative
operators there.


On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning  wrote:

> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov 
> wrote:
>
> > in simple terms, if non-integer row keying is used anywhere, it tries to
> > rewrite pipelines so that product orientations never require non-int keys
> > to denote columns. In case pipeline makes it impossible, optimizer will
> > refuse to produce a plan.
> >
> > e.g. suppose A is distributed string-keyed.
> >
> > (A.t %.% A) collect  // ok
> >
>
> What happens with the important case of  B.t %.% A where both A and B are
> string keyed?
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov  wrote:

> in simple terms, if non-integer row keying is used anywhere, it tries to
> rewrite pipelines so that product orientations never require non-int keys
> to denote columns. In case pipeline makes it impossible, optimizer will
> refuse to produce a plan.
>
> e.g. suppose A is distributed string-keyed.
>
> (A.t %.% A) collect  // ok
>

What happens with the important case of  B.t %.% A where both A and B are
string keyed?


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
in simple terms, if non-integer row keying is used anywhere, it tries to
rewrite pipelines so that product orientations never require non-int keys
to denote columns. In case pipeline makes it impossible, optimizer will
refuse to produce a plan.

e.g. suppose A is distributed string-keyed.

(A.t %.% A) collect  // ok

A.t collect // optimizer error

val (U, V, s) = dssvd(A) // OK, U keyed same way as A

val (U,V) = dals (A) // OK too

etc. etc.




On Wed, Jun 18, 2014 at 6:02 PM, Dmitriy Lyubimov  wrote:

>
>
>
> On Wed, Jun 18, 2014 at 5:58 PM, Ted Dunning 
> wrote:
>
>> On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov 
>> wrote:
>>
>> > >
>> > > Or simply that rows and columns are labeled?
>> > >
>> > rows are labeled. but they have algebraic signficance.
>> >
>>
>> Do they really?
>>
>> For the in-core system, if I add two matrices with different row labels,
>> the row labels are ignored.
>
>
> In-core system has always hard ordinal indexing. The out-of-core system
> has only hard ordinal indexing for columns, or rows when they are
> int-keyed.
>
>  If I multiply two matrices where the column
>> labels of the first matrix are in a different order than the row labels of
>> the second, the labels are again ignore.  If I do the transpose
>> multiplication where the row labels aren't in the same order, again, no
>> effect.
>>
>> Does the DSL actually permute the rows to make operations work correctly?
>>
>
> You'd be surprised  :)
>
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 6:02 PM, Dmitriy Lyubimov  wrote:

> > Does the DSL actually permute the rows to make operations work correctly?
> >
>
> You'd be surprised  :)
>

I might be or not, but I am not surprised by this answer.

What does the DSL actually do?


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 5:58 PM, Ted Dunning  wrote:

> On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov 
> wrote:
>
> > >
> > > Or simply that rows and columns are labeled?
> > >
> > rows are labeled. but they have algebraic signficance.
> >
>
> Do they really?
>
> For the in-core system, if I add two matrices with different row labels,
> the row labels are ignored.


In-core system has always hard ordinal indexing. The out-of-core system has
only hard ordinal indexing for columns, or rows when they are int-keyed.

 If I multiply two matrices where the column
> labels of the first matrix are in a different order than the row labels of
> the second, the labels are again ignore.  If I do the transpose
> multiplication where the row labels aren't in the same order, again, no
> effect.
>
> Does the DSL actually permute the rows to make operations work correctly?
>

You'd be surprised  :)


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov  wrote:

> >
> > Or simply that rows and columns are labeled?
> >
> rows are labeled. but they have algebraic signficance.
>

Do they really?

For the in-core system, if I add two matrices with different row labels,
the row labels are ignored.  If I multiply two matrices where the column
labels of the first matrix are in a different order than the row labels of
the second, the labels are again ignore.  If I do the transpose
multiplication where the row labels aren't in the same order, again, no
effect.

Does the DSL actually permute the rows to make operations work correctly?


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
> Are you saying that the values in the matrix are non-numbers?
>

No, our matrices are Real. but Anand was referring to row key support which
can be any type with a Writable view bound (in scala terms; also true with
their persistence in Mahout sequence file DRM format).



>
> Or simply that rows and columns are labeled?
>
rows are labeled. but they have algebraic signficance.


>
> I was trying to say the latter and add that the core of the matrix is
> entirely numerical.  This is certainly true of the in-core math.
>

True. But again, we were not discussing matrix elements. Just the labels.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 5:39 PM, Dmitriy Lyubimov  wrote:

> > Also, note that the row keys in Mahout are not actually stored in the
> > matrices that we manipulate.
>
>
> They are. I am not sure about DistributedRowMatrix class for mapreduce, but
> in sparkbindings they are. they are intimately relevant to all algebra and
> especially transposition rewrites.
>
> Even in-core matrices support column/row labels, although nobody seems to
> be using it.
>

Are you saying that the values in the matrix are non-numbers?

Or simply that rows and columns are labeled?

I was trying to say the latter and add that the core of the matrix is
entirely numerical.  This is certainly true of the in-core math.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 5:35 PM, Ted Dunning  wrote:

> Also, note that the row keys in Mahout are not actually stored in the
> matrices that we manipulate.


They are. I am not sure about DistributedRowMatrix class for mapreduce, but
in sparkbindings they are. they are intimately relevant to all algebra and
especially transposition rewrites.

Even in-core matrices support column/row labels, although nobody seems to
be using it.


> If the keys can be handled separately,
> outside of the flow for the data in a drm, then you should be pretty much
> good to go.
>
>
>
>
> On Wed, Jun 18, 2014 at 5:34 PM, Ted Dunning 
> wrote:
>
> >
> > On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov 
> > wrote:
> >
> >> > How important are the String row keys for the algorithms itself? Would
> >> it
> >> > grossly mess up a workflow if Strings are silently discarded by the
> >> > backend?
> >> >
> >>
> >> like i said, seq2sparse produces them, and postprocessing for stuff like
> >> LSA pipelines would not work.
> >
> >
> > Something as coarse as translating to a dictionary index would probably
> > work.  Creating the dictionary in parallel while reading the data should
> be
> > quite doable.
> >
> >
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
Also, note that the row keys in Mahout are not actually stored in the
matrices that we manipulate.  If the keys can be handled separately,
outside of the flow for the data in a drm, then you should be pretty much
good to go.




On Wed, Jun 18, 2014 at 5:34 PM, Ted Dunning  wrote:

>
> On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov 
> wrote:
>
>> > How important are the String row keys for the algorithms itself? Would
>> it
>> > grossly mess up a workflow if Strings are silently discarded by the
>> > backend?
>> >
>>
>> like i said, seq2sparse produces them, and postprocessing for stuff like
>> LSA pipelines would not work.
>
>
> Something as coarse as translating to a dictionary index would probably
> work.  Creating the dictionary in parallel while reading the data should be
> quite doable.
>
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov 
wrote:

> > How important are the String row keys for the algorithms itself? Would it
> > grossly mess up a workflow if Strings are silently discarded by the
> > backend?
> >
>
> like i said, seq2sparse produces them, and postprocessing for stuff like
> LSA pipelines would not work.


Something as coarse as translating to a dictionary index would probably
work.  Creating the dictionary in parallel while reading the data should be
quite doable.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 11:47 AM, Anand Avati  wrote:

> Supporting Int and Long keys are easy, both should be working shortly.
> String is tricky, as H2O stores only numbers. One suggestion has been to
> break up the string into bytes and store them as separate columns (and
> re-assemble them on demand). I'll look into String support after finishing
> the operators.
>
> How important are the String row keys for the algorithms itself? Would it
> grossly mess up a workflow if Strings are silently discarded by the
> backend?
>

like i said, seq2sparse produces them, and postprocessing for stuff like
LSA pipelines would not work.


>
>
> On Wed, Jun 18, 2014 at 10:58 AM, Dmitriy Lyubimov 
> wrote:
>
> > Supporting Int and String keys are perhaps minimum set (Long is welcome,
> > but a second-class citizen)
> >
> > supporting of DrmLike[Int] is required for a lot of things (e.g.
> > Transpose). DrmLike[String] is used in outputs of popular vectorizations
> in
> > Mahout such as seq2sparse.
> >
> >
> > On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:
> >
> > > Still incomplete, everything does NOT work. But lots of progress and
> end
> > is
> > > in sight.
> > >
> > > - Development happening at
> > > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm
> still
> > > doing lots of commit --amend and git push --force as this is my private
> > > tree.
> > >
> > > - Ground level build issues and classloader incompatibilities fixed.
> > >
> > > - Can load a matrix into H2O either from in core (through
> > drmParallelize())
> > > or HDFS (parser does not support seqfile yet)
> > >
> > > - Only Long type support for Row Keys so far.
> > >
> > > - mapBlock() works. This was the trickiest, other ops seem trivial in
> > > comparison.
> > >
> > > Everything else yet to be done. However I will be putting in more time
> > into
> > > this over the coming days (was working less than part time on this so
> > far.)
> > >
> > > Questions/comments welcome.
> > >
> >
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
Supporting Int and Long keys are easy, both should be working shortly.
String is tricky, as H2O stores only numbers. One suggestion has been to
break up the string into bytes and store them as separate columns (and
re-assemble them on demand). I'll look into String support after finishing
the operators.

How important are the String row keys for the algorithms itself? Would it
grossly mess up a workflow if Strings are silently discarded by the backend?



On Wed, Jun 18, 2014 at 10:58 AM, Dmitriy Lyubimov 
wrote:

> Supporting Int and String keys are perhaps minimum set (Long is welcome,
> but a second-class citizen)
>
> supporting of DrmLike[Int] is required for a lot of things (e.g.
> Transpose). DrmLike[String] is used in outputs of popular vectorizations in
> Mahout such as seq2sparse.
>
>
> On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:
>
> > Still incomplete, everything does NOT work. But lots of progress and end
> is
> > in sight.
> >
> > - Development happening at
> > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
> > doing lots of commit --amend and git push --force as this is my private
> > tree.
> >
> > - Ground level build issues and classloader incompatibilities fixed.
> >
> > - Can load a matrix into H2O either from in core (through
> drmParallelize())
> > or HDFS (parser does not support seqfile yet)
> >
> > - Only Long type support for Row Keys so far.
> >
> > - mapBlock() works. This was the trickiest, other ops seem trivial in
> > comparison.
> >
> > Everything else yet to be done. However I will be putting in more time
> into
> > this over the coming days (was working less than part time on this so
> far.)
> >
> > Questions/comments welcome.
> >
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
Supporting Int and String keys are perhaps minimum set (Long is welcome,
but a second-class citizen)

supporting of DrmLike[Int] is required for a lot of things (e.g.
Transpose). DrmLike[String] is used in outputs of popular vectorizations in
Mahout such as seq2sparse.


On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:

> Still incomplete, everything does NOT work. But lots of progress and end is
> in sight.
>
> - Development happening at
> https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
> doing lots of commit --amend and git push --force as this is my private
> tree.
>
> - Ground level build issues and classloader incompatibilities fixed.
>
> - Can load a matrix into H2O either from in core (through drmParallelize())
> or HDFS (parser does not support seqfile yet)
>
> - Only Long type support for Row Keys so far.
>
> - mapBlock() works. This was the trickiest, other ops seem trivial in
> comparison.
>
> Everything else yet to be done. However I will be putting in more time into
> this over the coming days (was working less than part time on this so far.)
>
> Questions/comments welcome.
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
This,  by first looks of it, is seriously cool.

I took liberty opening a preview PR just to be able to track your work in
that more visible way. All commits you make will be visible there, and all
comments anybody makes will be reflected to jira and mailing list.

-d


On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:

> Still incomplete, everything does NOT work. But lots of progress and end is
> in sight.
>
> - Development happening at
> https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
> doing lots of commit --amend and git push --force as this is my private
> tree.
>
> - Ground level build issues and classloader incompatibilities fixed.
>
> - Can load a matrix into H2O either from in core (through drmParallelize())
> or HDFS (parser does not support seqfile yet)
>
> - Only Long type support for Row Keys so far.
>
> - mapBlock() works. This was the trickiest, other ops seem trivial in
> comparison.
>
> Everything else yet to be done. However I will be putting in more time into
> this over the coming days (was working less than part time on this so far.)
>
> Questions/comments welcome.
>


Re: H2O integration - intermediate progress update

2014-06-18 Thread Sebastian Schelter
Very cool to hear that!
Am 18.06.2014 02:38 schrieb "Ted Dunning" :

> Very cool, Anand.
>
> Very exciting as it makes the multi-engine story make much more sense.
>
>
> On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:
>
> > Still incomplete, everything does NOT work. But lots of progress and end
> is
> > in sight.
> >
> > - Development happening at
> > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
> > doing lots of commit --amend and git push --force as this is my private
> > tree.
> >
> > - Ground level build issues and classloader incompatibilities fixed.
> >
> > - Can load a matrix into H2O either from in core (through
> drmParallelize())
> > or HDFS (parser does not support seqfile yet)
> >
> > - Only Long type support for Row Keys so far.
> >
> > - mapBlock() works. This was the trickiest, other ops seem trivial in
> > comparison.
> >
> > Everything else yet to be done. However I will be putting in more time
> into
> > this over the coming days (was working less than part time on this so
> far.)
> >
> > Questions/comments welcome.
> >
>


Re: H2O integration - intermediate progress update

2014-06-17 Thread Ted Dunning
Very cool, Anand.

Very exciting as it makes the multi-engine story make much more sense.


On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:

> Still incomplete, everything does NOT work. But lots of progress and end is
> in sight.
>
> - Development happening at
> https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
> doing lots of commit --amend and git push --force as this is my private
> tree.
>
> - Ground level build issues and classloader incompatibilities fixed.
>
> - Can load a matrix into H2O either from in core (through drmParallelize())
> or HDFS (parser does not support seqfile yet)
>
> - Only Long type support for Row Keys so far.
>
> - mapBlock() works. This was the trickiest, other ops seem trivial in
> comparison.
>
> Everything else yet to be done. However I will be putting in more time into
> this over the coming days (was working less than part time on this so far.)
>
> Questions/comments welcome.
>