Re: H2O integration - intermediate progress update

2014-06-19 Thread Sebastian Schelter
I share the impression that the tone of conversation has not been very 
welcoming lately, be it intentional or not. I think that we should 
remind ourselves why we are working on open source and try to improve 
our ways of communication.


I think we should try to get as much people as possible together to sit 
on a table and have some face-to-face discussion during a beer or coffee.


--sebastian

On 06/19/2014 07:18 AM, Dmitriy Lyubimov wrote:

On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:


On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:


I did not mean to discourage
sincere search for answers.



The tone of answers has lately been very discouraging for those sincerely
searching for answers.  I think we as a community have a responsibility to
do better about this.  There is no need to be insulting to people asking
honest questions in a civil tone.



Ted, we've been at this already. There have been more arguments than
questions. I am just providing my counter arguments. Do you insist on terms
insulting? Cause this, you know, insulting. You are heading ad hominem
direction again.





Re: H2O integration - intermediate progress update

2014-06-19 Thread Dmitriy Lyubimov
Well, let me tell my impression.

Remember, we started  talking impressions here all over the place, not
facts. So *don't ask me to prove anything.*

i have an impression that mr. Dunning has masterfully concealed a very
targeted insult in his carefully worded statement with the sole purpose of
forcing certain participants to go into defensive and to turn a technical
discussion into trading insults, in which he has obviously partially
succeeded.

I have an impression this has not been an isolated incident on mr.
Dunning's part in the past, and i have strong suspicion that it was the
wrong balance of technical merit and posturing in the project that drove
more than one accomplished committer or candidate out in the past.

I also have been receiving an impression that  I am  next such target on
mr. Dunnings part just because my arguments are not technically favorable
where he needs them to be favorable for whatever other-than-technical
reason. I love the code in the project, that's in part why i am candid in
its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
that is very close to driving me out. And I don't think beers can smooth
that.

As for welcoming, well, h2o  is not exactly new topic here. I also think we
need to have some bar for proposals to meet regardless of being welcoming.

Finally, I have an impression everybody has areas where they possess  less
than brilliant expertise; i actually like to say about myself that it
pains me how little i know. I have no problem identifying areas of
weaknesses in myself publicly and don't consider this to be offensive,
since i know that the only way to improve  knowledge is to first know where
it is lacking. I am very perceptive to strong logical argument regardless
if it fits my current world view or not. But I am particularly not fond of
rhetorical fallacies, informal ones in particular. I am not very fond of
marketing bluff or empty PR.  It is a personal choice whether you accept
that mindset or not, but grading areas of weakness is not an insult. That's
what they do in universities all the time, after all.


Re: H2O integration - intermediate progress update

2014-06-19 Thread Yash Sharma
Hi All,
Sorry to hijack the thread.
I am a newbie in mahout community - please pardon my words if anyone finds
them unsuitable.

Its really strange to see such heated discussions between the Big Shots on
the mailing lists.
I am absolute beginner in this space and it does not leave a very good
impression about the open source community itself.

What I believe is - Open Source is about love of code and all awesome
coders coming together - designing some of the coolest code projects on the
planet as one *strong team*. Lets not break this belief of newbies. We are
learning from you guys.

The avengers should not fight.

Best Regards,
Yash




On Thu, Jun 19, 2014 at 10:06 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 Well, let me tell my impression.

 Remember, we started  talking impressions here all over the place, not
 facts. So *don't ask me to prove anything.*

 i have an impression that mr. Dunning has masterfully concealed a very
 targeted insult in his carefully worded statement with the sole purpose of
 forcing certain participants to go into defensive and to turn a technical
 discussion into trading insults, in which he has obviously partially
 succeeded.

 I have an impression this has not been an isolated incident on mr.
 Dunning's part in the past, and i have strong suspicion that it was the
 wrong balance of technical merit and posturing in the project that drove
 more than one accomplished committer or candidate out in the past.

 I also have been receiving an impression that  I am  next such target on
 mr. Dunnings part just because my arguments are not technically favorable
 where he needs them to be favorable for whatever other-than-technical
 reason. I love the code in the project, that's in part why i am candid in
 its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
 that is very close to driving me out. And I don't think beers can smooth
 that.

 As for welcoming, well, h2o  is not exactly new topic here. I also think we
 need to have some bar for proposals to meet regardless of being welcoming.

 Finally, I have an impression everybody has areas where they possess  less
 than brilliant expertise; i actually like to say about myself that it
 pains me how little i know. I have no problem identifying areas of
 weaknesses in myself publicly and don't consider this to be offensive,
 since i know that the only way to improve  knowledge is to first know where
 it is lacking. I am very perceptive to strong logical argument regardless
 if it fits my current world view or not. But I am particularly not fond of
 rhetorical fallacies, informal ones in particular. I am not very fond of
 marketing bluff or empty PR.  It is a personal choice whether you accept
 that mindset or not, but grading areas of weakness is not an insult. That's
 what they do in universities all the time, after all.



Re: H2O integration - intermediate progress update

2014-06-19 Thread Ted Dunning
On Thu, Jun 19, 2014 at 10:09 AM, Yash Sharma yash...@gmail.com wrote:

 What I believe is - Open Source is about love of code and all awesome
 coders coming together - designing some of the coolest code projects on the
 planet as one *strong team*. Lets not break this belief of newbies. We are
 learning from you guys.

 The avengers should not fight.


Thanks Yash.

Sounds like you have a very good start here.  Community building is very
important and your calming words are a good way to encourage it.


RE: H2O integration - intermediate progress update

2014-06-19 Thread Saikat Kanjilal
I would agree and would second not wanting to hijack the discussion, I'm not a 
newbie to mahout but to be frank I've seen this tone from committers when 
evaluating or describing ideas or when judging new work that someone wants to 
contribute , I would also add that code committs can and should come in from 
anyone and be judged fairly without immediate and early dismissal of ideas.  
Frankly I'm interested in committing just for the purposes of learning and the 
general tone of this discussion is not encouraging to folks interested in 
shaping/using/adding to mahout for the future.

My 2 cents. 

 From: yash...@gmail.com
 Date: Thu, 19 Jun 2014 22:39:49 +0530
 Subject: Re: H2O integration - intermediate progress update
 To: dev@mahout.apache.org
 
 Hi All,
 Sorry to hijack the thread.
 I am a newbie in mahout community - please pardon my words if anyone finds
 them unsuitable.
 
 Its really strange to see such heated discussions between the Big Shots on
 the mailing lists.
 I am absolute beginner in this space and it does not leave a very good
 impression about the open source community itself.
 
 What I believe is - Open Source is about love of code and all awesome
 coders coming together - designing some of the coolest code projects on the
 planet as one *strong team*. Lets not break this belief of newbies. We are
 learning from you guys.
 
 The avengers should not fight.
 
 Best Regards,
 Yash
 
 
 
 
 On Thu, Jun 19, 2014 at 10:06 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
 
  Well, let me tell my impression.
 
  Remember, we started  talking impressions here all over the place, not
  facts. So *don't ask me to prove anything.*
 
  i have an impression that mr. Dunning has masterfully concealed a very
  targeted insult in his carefully worded statement with the sole purpose of
  forcing certain participants to go into defensive and to turn a technical
  discussion into trading insults, in which he has obviously partially
  succeeded.
 
  I have an impression this has not been an isolated incident on mr.
  Dunning's part in the past, and i have strong suspicion that it was the
  wrong balance of technical merit and posturing in the project that drove
  more than one accomplished committer or candidate out in the past.
 
  I also have been receiving an impression that  I am  next such target on
  mr. Dunnings part just because my arguments are not technically favorable
  where he needs them to be favorable for whatever other-than-technical
  reason. I love the code in the project, that's in part why i am candid in
  its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
  that is very close to driving me out. And I don't think beers can smooth
  that.
 
  As for welcoming, well, h2o  is not exactly new topic here. I also think we
  need to have some bar for proposals to meet regardless of being welcoming.
 
  Finally, I have an impression everybody has areas where they possess  less
  than brilliant expertise; i actually like to say about myself that it
  pains me how little i know. I have no problem identifying areas of
  weaknesses in myself publicly and don't consider this to be offensive,
  since i know that the only way to improve  knowledge is to first know where
  it is lacking. I am very perceptive to strong logical argument regardless
  if it fits my current world view or not. But I am particularly not fond of
  rhetorical fallacies, informal ones in particular. I am not very fond of
  marketing bluff or empty PR.  It is a personal choice whether you accept
  that mindset or not, but grading areas of weakness is not an insult. That's
  what they do in universities all the time, after all.
 
  

Re: H2O integration - intermediate progress update

2014-06-19 Thread Ted Dunning
On Thu, Jun 19, 2014 at 9:36 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 i have an impression that mr. Dunning has masterfully concealed a very
 targeted insult in his carefully worded statement with the sole purpose of
 forcing certain participants to go into defensive and to turn a technical
 discussion into trading insults, in which he has obviously partially
 succeeded.

 I have an impression this has not been an isolated incident on mr.
 Dunning's part in the past, and i have strong suspicion that it was the
 wrong balance of technical merit and posturing in the project that drove
 more than one accomplished committer or candidate out in the past.

 I also have been receiving an impression that  I am  next such target on
 mr. Dunnings part just because my arguments are not technically favorable
 where he needs them to be favorable for whatever other-than-technical
 reason. I love the code in the project, that's in part why i am candid in
 its discussions, but it is repeated agrumentum ad hominem  from mr. Dunning
 that is very close to driving me out. And I don't think beers can smooth
 that.


I can say that my only intent was to try to help get the tone on the
mailing list back to a more gentle and encouraging path.  I did not intend
any insult and purposely tried to refer to all of us together as having the
problem.

In general, the only target I have is building up the Mahout community.

I don't want to encourage a negative thread to continue very far, but I do
feel that there is a difference between technical discussions about
technical merit, technical discussions that descend into personal attacks
and discussions about the form, tone and manner of discussions.

Only the second is, in my opinion, ad hominem.  I think we all agree that
it is a bad thing.

The first form of discussion is what we should mostly have, but
occasionally there needs to be a bit of discussion of the third kind.  In
particular, occasional feedback such as the impassioned comment by Yash
just now that things are not working right can be very, very helpful.
 Whenever anybody gives this feedback it is important to step back a bit
and think about what it means.  For instance, even though I disagree with
Dmitriy's assessment of my motives, I am going to think carefully about how
to improve his impressions.

This third kind of discussion can be delicate and difficult.  It can be
distasteful to have in public, but I think that we all owe it to the
project to try to make things work better if we possibly can.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
This,  by first looks of it, is seriously cool.

I took liberty opening a preview PR just to be able to track your work in
that more visible way. All commits you make will be visible there, and all
comments anybody makes will be reflected to jira and mailing list.

-d


On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati av...@gluster.org wrote:

 Still incomplete, everything does NOT work. But lots of progress and end is
 in sight.

 - Development happening at
 https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
 doing lots of commit --amend and git push --force as this is my private
 tree.

 - Ground level build issues and classloader incompatibilities fixed.

 - Can load a matrix into H2O either from in core (through drmParallelize())
 or HDFS (parser does not support seqfile yet)

 - Only Long type support for Row Keys so far.

 - mapBlock() works. This was the trickiest, other ops seem trivial in
 comparison.

 Everything else yet to be done. However I will be putting in more time into
 this over the coming days (was working less than part time on this so far.)

 Questions/comments welcome.



Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
Supporting Int and String keys are perhaps minimum set (Long is welcome,
but a second-class citizen)

supporting of DrmLike[Int] is required for a lot of things (e.g.
Transpose). DrmLike[String] is used in outputs of popular vectorizations in
Mahout such as seq2sparse.


On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati av...@gluster.org wrote:

 Still incomplete, everything does NOT work. But lots of progress and end is
 in sight.

 - Development happening at
 https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
 doing lots of commit --amend and git push --force as this is my private
 tree.

 - Ground level build issues and classloader incompatibilities fixed.

 - Can load a matrix into H2O either from in core (through drmParallelize())
 or HDFS (parser does not support seqfile yet)

 - Only Long type support for Row Keys so far.

 - mapBlock() works. This was the trickiest, other ops seem trivial in
 comparison.

 Everything else yet to be done. However I will be putting in more time into
 this over the coming days (was working less than part time on this so far.)

 Questions/comments welcome.



Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
Supporting Int and Long keys are easy, both should be working shortly.
String is tricky, as H2O stores only numbers. One suggestion has been to
break up the string into bytes and store them as separate columns (and
re-assemble them on demand). I'll look into String support after finishing
the operators.

How important are the String row keys for the algorithms itself? Would it
grossly mess up a workflow if Strings are silently discarded by the backend?



On Wed, Jun 18, 2014 at 10:58 AM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 Supporting Int and String keys are perhaps minimum set (Long is welcome,
 but a second-class citizen)

 supporting of DrmLike[Int] is required for a lot of things (e.g.
 Transpose). DrmLike[String] is used in outputs of popular vectorizations in
 Mahout such as seq2sparse.


 On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati av...@gluster.org wrote:

  Still incomplete, everything does NOT work. But lots of progress and end
 is
  in sight.
 
  - Development happening at
  https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
  doing lots of commit --amend and git push --force as this is my private
  tree.
 
  - Ground level build issues and classloader incompatibilities fixed.
 
  - Can load a matrix into H2O either from in core (through
 drmParallelize())
  or HDFS (parser does not support seqfile yet)
 
  - Only Long type support for Row Keys so far.
 
  - mapBlock() works. This was the trickiest, other ops seem trivial in
  comparison.
 
  Everything else yet to be done. However I will be putting in more time
 into
  this over the coming days (was working less than part time on this so
 far.)
 
  Questions/comments welcome.
 



Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 11:47 AM, Anand Avati av...@gluster.org wrote:

 Supporting Int and Long keys are easy, both should be working shortly.
 String is tricky, as H2O stores only numbers. One suggestion has been to
 break up the string into bytes and store them as separate columns (and
 re-assemble them on demand). I'll look into String support after finishing
 the operators.

 How important are the String row keys for the algorithms itself? Would it
 grossly mess up a workflow if Strings are silently discarded by the
 backend?


like i said, seq2sparse produces them, and postprocessing for stuff like
LSA pipelines would not work.




 On Wed, Jun 18, 2014 at 10:58 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Supporting Int and String keys are perhaps minimum set (Long is welcome,
  but a second-class citizen)
 
  supporting of DrmLike[Int] is required for a lot of things (e.g.
  Transpose). DrmLike[String] is used in outputs of popular vectorizations
 in
  Mahout such as seq2sparse.
 
 
  On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati av...@gluster.org wrote:
 
   Still incomplete, everything does NOT work. But lots of progress and
 end
  is
   in sight.
  
   - Development happening at
   https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm
 still
   doing lots of commit --amend and git push --force as this is my private
   tree.
  
   - Ground level build issues and classloader incompatibilities fixed.
  
   - Can load a matrix into H2O either from in core (through
  drmParallelize())
   or HDFS (parser does not support seqfile yet)
  
   - Only Long type support for Row Keys so far.
  
   - mapBlock() works. This was the trickiest, other ops seem trivial in
   comparison.
  
   Everything else yet to be done. However I will be putting in more time
  into
   this over the coming days (was working less than part time on this so
  far.)
  
   Questions/comments welcome.
  
 



Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

  How important are the String row keys for the algorithms itself? Would it
  grossly mess up a workflow if Strings are silently discarded by the
  backend?
 

 like i said, seq2sparse produces them, and postprocessing for stuff like
 LSA pipelines would not work.


Something as coarse as translating to a dictionary index would probably
work.  Creating the dictionary in parallel while reading the data should be
quite doable.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
Also, note that the row keys in Mahout are not actually stored in the
matrices that we manipulate.  If the keys can be handled separately,
outside of the flow for the data in a drm, then you should be pretty much
good to go.




On Wed, Jun 18, 2014 at 5:34 PM, Ted Dunning ted.dunn...@gmail.com wrote:


 On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  How important are the String row keys for the algorithms itself? Would
 it
  grossly mess up a workflow if Strings are silently discarded by the
  backend?
 

 like i said, seq2sparse produces them, and postprocessing for stuff like
 LSA pipelines would not work.


 Something as coarse as translating to a dictionary index would probably
 work.  Creating the dictionary in parallel while reading the data should be
 quite doable.




Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 5:35 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Also, note that the row keys in Mahout are not actually stored in the
 matrices that we manipulate.


They are. I am not sure about DistributedRowMatrix class for mapreduce, but
in sparkbindings they are. they are intimately relevant to all algebra and
especially transposition rewrites.

Even in-core matrices support column/row labels, although nobody seems to
be using it.


 If the keys can be handled separately,
 outside of the flow for the data in a drm, then you should be pretty much
 good to go.




 On Wed, Jun 18, 2014 at 5:34 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 
  On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   How important are the String row keys for the algorithms itself? Would
  it
   grossly mess up a workflow if Strings are silently discarded by the
   backend?
  
 
  like i said, seq2sparse produces them, and postprocessing for stuff like
  LSA pipelines would not work.
 
 
  Something as coarse as translating to a dictionary index would probably
  work.  Creating the dictionary in parallel while reading the data should
 be
  quite doable.
 
 



Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 5:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

  Also, note that the row keys in Mahout are not actually stored in the
  matrices that we manipulate.


 They are. I am not sure about DistributedRowMatrix class for mapreduce, but
 in sparkbindings they are. they are intimately relevant to all algebra and
 especially transposition rewrites.

 Even in-core matrices support column/row labels, although nobody seems to
 be using it.


Are you saying that the values in the matrix are non-numbers?

Or simply that rows and columns are labeled?

I was trying to say the latter and add that the core of the matrix is
entirely numerical.  This is certainly true of the in-core math.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
 Are you saying that the values in the matrix are non-numbers?


No, our matrices are Real. but Anand was referring to row key support which
can be any type with a Writable view bound (in scala terms; also true with
their persistence in Mahout sequence file DRM format).




 Or simply that rows and columns are labeled?

rows are labeled. but they have algebraic signficance.



 I was trying to say the latter and add that the core of the matrix is
 entirely numerical.  This is certainly true of the in-core math.


True. But again, we were not discussing matrix elements. Just the labels.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 
  Or simply that rows and columns are labeled?
 
 rows are labeled. but they have algebraic signficance.


Do they really?

For the in-core system, if I add two matrices with different row labels,
the row labels are ignored.  If I multiply two matrices where the column
labels of the first matrix are in a different order than the row labels of
the second, the labels are again ignore.  If I do the transpose
multiplication where the row labels aren't in the same order, again, no
effect.

Does the DSL actually permute the rows to make operations work correctly?


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 5:58 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  
   Or simply that rows and columns are labeled?
  
  rows are labeled. but they have algebraic signficance.
 

 Do they really?

 For the in-core system, if I add two matrices with different row labels,
 the row labels are ignored.


In-core system has always hard ordinal indexing. The out-of-core system has
only hard ordinal indexing for columns, or rows when they are int-keyed.

 If I multiply two matrices where the column
 labels of the first matrix are in a different order than the row labels of
 the second, the labels are again ignore.  If I do the transpose
 multiplication where the row labels aren't in the same order, again, no
 effect.

 Does the DSL actually permute the rows to make operations work correctly?


You'd be surprised  :)


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 6:02 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

  Does the DSL actually permute the rows to make operations work correctly?
 

 You'd be surprised  :)


I might be or not, but I am not surprised by this answer.

What does the DSL actually do?


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
in simple terms, if non-integer row keying is used anywhere, it tries to
rewrite pipelines so that product orientations never require non-int keys
to denote columns. In case pipeline makes it impossible, optimizer will
refuse to produce a plan.

e.g. suppose A is distributed string-keyed.

(A.t %.% A) collect  // ok

A.t collect // optimizer error

val (U, V, s) = dssvd(A) // OK, U keyed same way as A

val (U,V) = dals (A) // OK too

etc. etc.




On Wed, Jun 18, 2014 at 6:02 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:




 On Wed, Jun 18, 2014 at 5:58 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  
   Or simply that rows and columns are labeled?
  
  rows are labeled. but they have algebraic signficance.
 

 Do they really?

 For the in-core system, if I add two matrices with different row labels,
 the row labels are ignored.


 In-core system has always hard ordinal indexing. The out-of-core system
 has only hard ordinal indexing for columns, or rows when they are
 int-keyed.

  If I multiply two matrices where the column
 labels of the first matrix are in a different order than the row labels of
 the second, the labels are again ignore.  If I do the transpose
 multiplication where the row labels aren't in the same order, again, no
 effect.

 Does the DSL actually permute the rows to make operations work correctly?


 You'd be surprised  :)




Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 in simple terms, if non-integer row keying is used anywhere, it tries to
 rewrite pipelines so that product orientations never require non-int keys
 to denote columns. In case pipeline makes it impossible, optimizer will
 refuse to produce a plan.

 e.g. suppose A is distributed string-keyed.

 (A.t %.% A) collect  // ok


What happens with the important case of  B.t %.% A where both A and B are
string keyed?


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
I am not sure. There are more rewriting rules than i can remember, and i
did not write an algorithm ( i think) that would involve this combination.
I guess the best thing is to try in a shell or a unit test. if it falls
thru, perhaps a new plan element needs to be added (although I am not very
sure there isn't already). I know that there are join-based multiplicative
operators there.


On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  in simple terms, if non-integer row keying is used anywhere, it tries to
  rewrite pipelines so that product orientations never require non-int keys
  to denote columns. In case pipeline makes it impossible, optimizer will
  refuse to produce a plan.
 
  e.g. suppose A is distributed string-keyed.
 
  (A.t %.% A) collect  // ok
 

 What happens with the important case of  B.t %.% A where both A and B are
 string keyed?



Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
E.g. i know for sure A %.% B is legal where A is string-keyed and b is
int-keyed.

This is kind of not the point. the point is that you can easily modify
rewriting rules and operators to cover misses. (there shouldn't be many,
since we've already written quite a bit of expressions out there).


On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 I am not sure. There are more rewriting rules than i can remember, and i
 did not write an algorithm ( i think) that would involve this combination.
 I guess the best thing is to try in a shell or a unit test. if it falls
 thru, perhaps a new plan element needs to be added (although I am not very
 sure there isn't already). I know that there are join-based multiplicative
 operators there.


 On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  in simple terms, if non-integer row keying is used anywhere, it tries to
  rewrite pipelines so that product orientations never require non-int
 keys
  to denote columns. In case pipeline makes it impossible, optimizer will
  refuse to produce a plan.
 
  e.g. suppose A is distributed string-keyed.
 
  (A.t %.% A) collect  // ok
 

 What happens with the important case of  B.t %.% A where both A and B are
 string keyed?





Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
a little bit of additional information is that for rewriting rules stage
optimizer does 3 passes over semantic tree, each pass matching a tree
fragment using Scala case class matching and rewriting. This allows to
match and rewrite pretty elaborate tree structure fragments, although at
the moment i don't think we dig farther than immediate children, and
perhaps some their known attributes, in most cases.

More detailed description that that i think is only in reading the source.


On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 E.g. i know for sure A %.% B is legal where A is string-keyed and b is
 int-keyed.

 This is kind of not the point. the point is that you can easily modify
 rewriting rules and operators to cover misses. (there shouldn't be many,
 since we've already written quite a bit of expressions out there).


 On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 I am not sure. There are more rewriting rules than i can remember, and i
 did not write an algorithm ( i think) that would involve this combination.
 I guess the best thing is to try in a shell or a unit test. if it falls
 thru, perhaps a new plan element needs to be added (although I am not very
 sure there isn't already). I know that there are join-based multiplicative
 operators there.


 On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  in simple terms, if non-integer row keying is used anywhere, it tries
 to
  rewrite pipelines so that product orientations never require non-int
 keys
  to denote columns. In case pipeline makes it impossible, optimizer will
  refuse to produce a plan.
 
  e.g. suppose A is distributed string-keyed.
 
  (A.t %.% A) collect  // ok
 

 What happens with the important case of  B.t %.% A where both A and B are
 string keyed?






Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
also, if something is not supported, such as your example, (if it is not
supported), optimizer would simply state so with rejection. But if it takes
it in, then I am pretty sure it will do the right job (or at least there's
a unit test for that case that is asserted on a trivial example).

Here, by trivial i mean local pipelines for 2-split inputs, that's the
general rule i used.


On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 a little bit of additional information is that for rewriting rules stage
 optimizer does 3 passes over semantic tree, each pass matching a tree
 fragment using Scala case class matching and rewriting. This allows to
 match and rewrite pretty elaborate tree structure fragments, although at
 the moment i don't think we dig farther than immediate children, and
 perhaps some their known attributes, in most cases.

 More detailed description that that i think is only in reading the source.


 On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 E.g. i know for sure A %.% B is legal where A is string-keyed and b is
 int-keyed.

 This is kind of not the point. the point is that you can easily modify
 rewriting rules and operators to cover misses. (there shouldn't be many,
 since we've already written quite a bit of expressions out there).


 On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 I am not sure. There are more rewriting rules than i can remember, and i
 did not write an algorithm ( i think) that would involve this combination.
 I guess the best thing is to try in a shell or a unit test. if it falls
 thru, perhaps a new plan element needs to be added (although I am not very
 sure there isn't already). I know that there are join-based multiplicative
 operators there.


 On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  in simple terms, if non-integer row keying is used anywhere, it tries
 to
  rewrite pipelines so that product orientations never require non-int
 keys
  to denote columns. In case pipeline makes it impossible, optimizer
 will
  refuse to produce a plan.
 
  e.g. suppose A is distributed string-keyed.
 
  (A.t %.% A) collect  // ok
 

 What happens with the important case of  B.t %.% A where both A and B
 are
 string keyed?







Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
Looking at the code, i am still not sure without trying.

but i am more inclined to think now that this specific combination, A'B
with A and B non-int row keys, is not supported.

As a general principle, we followed where our guinea pigs get us, and were
not trying to fill all possible gaps and holes, with the belief that will
get us 80/20 caps in shortest time.

As for the rest, we wait for somebody to ask for it because they need it.

But that example is legal and patch should be fundamentally possible and
easy enough to handle this case within this architecture.




On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 also, if something is not supported, such as your example, (if it is not
 supported), optimizer would simply state so with rejection. But if it takes
 it in, then I am pretty sure it will do the right job (or at least there's
 a unit test for that case that is asserted on a trivial example).

 Here, by trivial i mean local pipelines for 2-split inputs, that's the
 general rule i used.


 On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 a little bit of additional information is that for rewriting rules stage
 optimizer does 3 passes over semantic tree, each pass matching a tree
 fragment using Scala case class matching and rewriting. This allows to
 match and rewrite pretty elaborate tree structure fragments, although at
 the moment i don't think we dig farther than immediate children, and
 perhaps some their known attributes, in most cases.

 More detailed description that that i think is only in reading the source.


 On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 E.g. i know for sure A %.% B is legal where A is string-keyed and b is
 int-keyed.

 This is kind of not the point. the point is that you can easily modify
 rewriting rules and operators to cover misses. (there shouldn't be many,
 since we've already written quite a bit of expressions out there).


 On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 I am not sure. There are more rewriting rules than i can remember, and
 i did not write an algorithm ( i think) that would involve this
 combination. I guess the best thing is to try in a shell or a unit test. if
 it falls thru, perhaps a new plan element needs to be added (although I am
 not very sure there isn't already). I know that there are join-based
 multiplicative operators there.


 On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  in simple terms, if non-integer row keying is used anywhere, it
 tries to
  rewrite pipelines so that product orientations never require non-int
 keys
  to denote columns. In case pipeline makes it impossible, optimizer
 will
  refuse to produce a plan.
 
  e.g. suppose A is distributed string-keyed.
 
  (A.t %.% A) collect  // ok
 

 What happens with the important case of  B.t %.% A where both A and B
 are
 string keyed?








Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
Would it not be possible (or even a good idea) to keep row keys completely
separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the
operators (so far) care about the keys. At least none of the existing
mapBlock() users do anything with the key. I'm not sure if we can do
anything meaningful with the key in a mapBlock. It feels they are tightly
coupled while they need not have been. I must admit I'm new to this, but it
feels like - keys could be stored in a separate file, and matrix numbers in
another. Mahout (should) only care about and operate on Matrix numbers,
reads from the number file, writes output to a new number file, and the
user can use the new number file with the old/original key file -
effectively the same result as loading keys and moving them around through
all the operations and writing back. Am I missing something fundamental?

Thanks


On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Looking at the code, i am still not sure without trying.

 but i am more inclined to think now that this specific combination, A'B
 with A and B non-int row keys, is not supported.

 As a general principle, we followed where our guinea pigs get us, and were
 not trying to fill all possible gaps and holes, with the belief that will
 get us 80/20 caps in shortest time.

 As for the rest, we wait for somebody to ask for it because they need it.

 But that example is legal and patch should be fundamentally possible and
 easy enough to handle this case within this architecture.




 On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  also, if something is not supported, such as your example, (if it is not
  supported), optimizer would simply state so with rejection. But if it
 takes
  it in, then I am pretty sure it will do the right job (or at least
 there's
  a unit test for that case that is asserted on a trivial example).
 
  Here, by trivial i mean local pipelines for 2-split inputs, that's the
  general rule i used.
 
 
  On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  a little bit of additional information is that for rewriting rules stage
  optimizer does 3 passes over semantic tree, each pass matching a tree
  fragment using Scala case class matching and rewriting. This allows to
  match and rewrite pretty elaborate tree structure fragments, although at
  the moment i don't think we dig farther than immediate children, and
  perhaps some their known attributes, in most cases.
 
  More detailed description that that i think is only in reading the
 source.
 
 
  On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  E.g. i know for sure A %.% B is legal where A is string-keyed and b is
  int-keyed.
 
  This is kind of not the point. the point is that you can easily modify
  rewriting rules and operators to cover misses. (there shouldn't be
 many,
  since we've already written quite a bit of expressions out there).
 
 
  On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I am not sure. There are more rewriting rules than i can remember, and
  i did not write an algorithm ( i think) that would involve this
  combination. I guess the best thing is to try in a shell or a unit
 test. if
  it falls thru, perhaps a new plan element needs to be added (although
 I am
  not very sure there isn't already). I know that there are join-based
  multiplicative operators there.
 
 
  On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov dlie...@gmail.com
 
  wrote:
 
   in simple terms, if non-integer row keying is used anywhere, it
  tries to
   rewrite pipelines so that product orientations never require
 non-int
  keys
   to denote columns. In case pipeline makes it impossible, optimizer
  will
   refuse to produce a plan.
  
   e.g. suppose A is distributed string-keyed.
  
   (A.t %.% A) collect  // ok
  
 
  What happens with the important case of  B.t %.% A where both A and B
  are
  string keyed?
 
 
 
 
 
 



Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
I see that this key'ing is an artifact of the sequencefile format (reading
more about it just now). As I'm reading it also feels like sequencefile is
really designed with the map/reduce framework in mind, suited well for the
mapper API. It also feels like, in the real world, data is
generated/available in a different and more natural formats, and an
ingestion phase converts the more natural file into a sequencefile just
for mapreduce processing. Naive question - Is it still relevant to support
this format, given the move away from MR within Mahout? Why design the core
data structure around a format from the framework we moved away? Why not
work off just CSV files etc.? Also, if we did not have Keys in DRM, most of
the code in the DSL need not have a type parameter, making it so much
simpler for a first timer to read..

thanks!

On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati av...@gluster.org wrote:

 Would it not be possible (or even a good idea) to keep row keys completely
 separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the
 operators (so far) care about the keys. At least none of the existing
 mapBlock() users do anything with the key. I'm not sure if we can do
 anything meaningful with the key in a mapBlock. It feels they are tightly
 coupled while they need not have been. I must admit I'm new to this, but it
 feels like - keys could be stored in a separate file, and matrix numbers in
 another. Mahout (should) only care about and operate on Matrix numbers,
 reads from the number file, writes output to a new number file, and the
 user can use the new number file with the old/original key file -
 effectively the same result as loading keys and moving them around through
 all the operations and writing back. Am I missing something fundamental?

 Thanks


 On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 Looking at the code, i am still not sure without trying.

 but i am more inclined to think now that this specific combination, A'B
 with A and B non-int row keys, is not supported.

 As a general principle, we followed where our guinea pigs get us, and were
 not trying to fill all possible gaps and holes, with the belief that will
 get us 80/20 caps in shortest time.

 As for the rest, we wait for somebody to ask for it because they need it.

 But that example is legal and patch should be fundamentally possible and
 easy enough to handle this case within this architecture.




 On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  also, if something is not supported, such as your example, (if it is not
  supported), optimizer would simply state so with rejection. But if it
 takes
  it in, then I am pretty sure it will do the right job (or at least
 there's
  a unit test for that case that is asserted on a trivial example).
 
  Here, by trivial i mean local pipelines for 2-split inputs, that's the
  general rule i used.
 
 
  On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  a little bit of additional information is that for rewriting rules
 stage
  optimizer does 3 passes over semantic tree, each pass matching a tree
  fragment using Scala case class matching and rewriting. This allows to
  match and rewrite pretty elaborate tree structure fragments, although
 at
  the moment i don't think we dig farther than immediate children, and
  perhaps some their known attributes, in most cases.
 
  More detailed description that that i think is only in reading the
 source.
 
 
  On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  E.g. i know for sure A %.% B is legal where A is string-keyed and b is
  int-keyed.
 
  This is kind of not the point. the point is that you can easily modify
  rewriting rules and operators to cover misses. (there shouldn't be
 many,
  since we've already written quite a bit of expressions out there).
 
 
  On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I am not sure. There are more rewriting rules than i can remember,
 and
  i did not write an algorithm ( i think) that would involve this
  combination. I guess the best thing is to try in a shell or a unit
 test. if
  it falls thru, perhaps a new plan element needs to be added
 (although I am
  not very sure there isn't already). I know that there are join-based
  multiplicative operators there.
 
 
  On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov 
 dlie...@gmail.com
  wrote:
 
   in simple terms, if non-integer row keying is used anywhere, it
  tries to
   rewrite pipelines so that product orientations never require
 non-int
  keys
   to denote columns. In case pipeline makes it impossible, optimizer
  will
   refuse to produce a plan.
  
   e.g. suppose A is distributed string-keyed.
  
   (A.t %.% A) collect  // ok
  
 
  What happens with the important case of  B.t %.% A where both A and
 

Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati av...@gluster.org wrote:

 Would it not be possible (or even a good idea) to keep row keys completely
 separate from DRM, and let DRMs be pure nRow x nCol numbers?


Considering this is only at the cost of breaking compatibility with all MR
stuff that's been done in Mahout since 2008. Not an option.
But suppose legacy was not a problem, I see signficant benefits in allowing
non-ordinal keys.

One thing, data almost never  comes out of ETL pipelines with
ordinary-enforced keys. Normalizing ordinarity would be a pain. There's
normalization issue for dense data, and there's uniqueness requirement for
sparse data (in which case it really is no different from any key with only
requirements for hash/equals contracts)

Second, having to map to integral keys is creating problems relating and
maintaining relations of the stuff back to its origins.

Given it's already there, being in a position of an architect, I'd never
give it back.




 None of the
 operators (so far) care about the keys.


Simply not true. LSA does, clustering does, and about other dozen cases in
and outside Mahout. Assuming we are still to support algorithms we have not
deprecated to date.


 At least none of the existing
 mapBlock() users do anything with the key.


not true. Not all examples in Mahout, but not true.


 I'm not sure if we can do
 anything meaningful with the key in a mapBlock.


You not being sure is not sufficient condition. Sufficient condition
everyone has to be sure to the contrary. It is always hard to argue
non-existence of a counter example from positions of probabilities or
intuition.


 It feels they are tightly
 coupled while they need not have been. I must admit I'm new to this, but it
 feels like - keys could be stored in a separate file, and matrix numbers in
 another. Mahout (should) only care about and operate on Matrix numbers,
 reads from the number file, writes output to a new number file, and the
 user can use the new number file with the old/original key file -
 effectively the same result as loading keys and moving them around through
 all the operations and writing back. Am I missing something fundamental?


All i said. legacy, ordinality enforcement etc. etc.



 Thanks


 On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Looking at the code, i am still not sure without trying.
 
  but i am more inclined to think now that this specific combination, A'B
  with A and B non-int row keys, is not supported.
 
  As a general principle, we followed where our guinea pigs get us, and
 were
  not trying to fill all possible gaps and holes, with the belief that will
  get us 80/20 caps in shortest time.
 
  As for the rest, we wait for somebody to ask for it because they need it.
 
  But that example is legal and patch should be fundamentally possible and
  easy enough to handle this case within this architecture.
 
 
 
 
  On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   also, if something is not supported, such as your example, (if it is
 not
   supported), optimizer would simply state so with rejection. But if it
  takes
   it in, then I am pretty sure it will do the right job (or at least
  there's
   a unit test for that case that is asserted on a trivial example).
  
   Here, by trivial i mean local pipelines for 2-split inputs, that's the
   general rule i used.
  
  
   On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
   a little bit of additional information is that for rewriting rules
 stage
   optimizer does 3 passes over semantic tree, each pass matching a tree
   fragment using Scala case class matching and rewriting. This allows to
   match and rewrite pretty elaborate tree structure fragments, although
 at
   the moment i don't think we dig farther than immediate children, and
   perhaps some their known attributes, in most cases.
  
   More detailed description that that i think is only in reading the
  source.
  
  
   On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
   E.g. i know for sure A %.% B is legal where A is string-keyed and b
 is
   int-keyed.
  
   This is kind of not the point. the point is that you can easily
 modify
   rewriting rules and operators to cover misses. (there shouldn't be
  many,
   since we've already written quite a bit of expressions out there).
  
  
   On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov dlie...@gmail.com
 
   wrote:
  
   I am not sure. There are more rewriting rules than i can remember,
 and
   i did not write an algorithm ( i think) that would involve this
   combination. I guess the best thing is to try in a shell or a unit
  test. if
   it falls thru, perhaps a new plan element needs to be added
 (although
  I am
   not very sure there isn't already). I know that there are join-based
   multiplicative operators there.
  
  
   On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning 

Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati av...@gluster.org wrote:

 I see that this key'ing is an artifact of the sequencefile format (reading
 more about it just now).


I view it differently. Having to have ordinal keys on columns is an
artifact of sequence file format. Or Mahout legacy, whatever. row keys are
not constrained to anything. One could require int keys (and a lot of
operations do).

Sequence file indeed has two payload spots in a record, but it doesn't
constrain you to not having keys, or having 333 keys.  The only essential
function of sequence file is sync-able splittability and payload
compression abstraction. People use plain text files with mapreduce for the
same reason, but they don't have clear key-value structure.




 As I'm reading it also feels like sequencefile is
 really designed with the map/reduce framework in mind,


again, not true, it is designed with data affinity in mind. Spark requires
(or, rather, benefits from) data affinity just as much as map reduce, and
so does Stratoshpere, and, to much smaller degree, HBase. Any parallel
system that sends code to the data, and not the other way around, would
require some notion of data partitioning, both in persistent state and
in-memory.

It would seem to me you hold a lot of misconceptions about why and what
exists in Hadoop (not that all that exists there, exists for a good reason
though; and what exists for a good reason, usually could be tons times
better).


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati av...@gluster.org wrote:

  Also, if we did not have Keys in DRM, most of
 the code in the DSL need not have a type parameter, making it so much
 simpler for a first timer to read..


This is also something i absolutely not sure where it is coming from.

Let's see:

Mahout   expression |   R expression

A %*% B  | A %*% B
A[, 5] | A(::,5)
cbind(A,B) | A cbind B
A * B | A * B
1 / x | 1 /: x
t(A) | A.t
norm(A) | A.norm
colSums(A) | A.colSums

Where is the struggle here ?

I suspect the real reason for all these questions is not architectural, but
rather simplification of H20 bindings.

That is, probably, a really worthy question: are we ready to screw legacy
algorithm compatibility and existing bindings' merits just to make h2o
integration easier? This is a good question, but i am far from sure i would
vote yes here.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
On Wed, Jun 18, 2014 at 9:10 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati av...@gluster.org wrote:

  I see that this key'ing is an artifact of the sequencefile format
 (reading
  more about it just now).


 I view it differently. Having to have ordinal keys on columns is an
 artifact of sequence file format. Or Mahout legacy, whatever. row keys are
 not constrained to anything. One could require int keys (and a lot of
 operations do).

 Sequence file indeed has two payload spots in a record, but it doesn't
 constrain you to not having keys, or having 333 keys.  The only essential
 function of sequence file is sync-able splittability and payload
 compression abstraction. People use plain text files with mapreduce for the
 same reason, but they don't have clear key-value structure.




  As I'm reading it also feels like sequencefile is
  really designed with the map/reduce framework in mind,


 again, not true, it is designed with data affinity in mind. Spark requires
 (or, rather, benefits from) data affinity just as much as map reduce, and
 so does Stratoshpere, and, to much smaller degree, HBase. Any parallel
 system that sends code to the data, and not the other way around, would
 require some notion of data partitioning, both in persistent state and
 in-memory.

 It would seem to me you hold a lot of misconceptions about why and what
 exists in Hadoop (not that all that exists there, exists for a good reason
 though; and what exists for a good reason, usually could be tons times
 better).


I'm only learning about Hadoop now. I'm very new to it. Wouldn't be
surprised if I have misconceptions of a few things!


Re: H2O integration - intermediate progress update

2014-06-18 Thread Anand Avati
On Wed, Jun 18, 2014 at 9:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati av...@gluster.org wrote:

   Also, if we did not have Keys in DRM, most of
  the code in the DSL need not have a type parameter, making it so much
  simpler for a first timer to read..
 

 This is also something i absolutely not sure where it is coming from.

 Let's see:

 Mahout   expression |   R expression

 A %*% B  | A %*% B
 A[, 5] | A(::,5)
 cbind(A,B) | A cbind B
 A * B | A * B
 1 / x | 1 /: x
 t(A) | A.t
 norm(A) | A.norm
 colSums(A) | A.colSums

 Where is the struggle here ?


Not in this at all, but all over the place in sparkbindings (the backend of
the DSL).



 I suspect the real reason for all these questions is not architectural, but
 rather simplification of H20 bindings.

 That is, probably, a really worthy question: are we ready to screw legacy
 algorithm compatibility and existing bindings' merits just to make h2o
 integration easier? This is a good question, but i am far from sure i would
 vote yes here.


Well, sure. I would like to simplify H2O bindings to the extent I can (or
simplify any task I do in any project). I expect not all questions might
make sense for those who have a bigger context, but I still ask without
hesitation.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 9:32 PM, Anand Avati av...@gluster.org wrote:

 On Wed, Jun 18, 2014 at 9:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati av...@gluster.org wrote:
 
Also, if we did not have Keys in DRM, most of
   the code in the DSL need not have a type parameter, making it so much
   simpler for a first timer to read..
  
 
  This is also something i absolutely not sure where it is coming from.
 
  Let's see:
 
  Mahout   expression |   R expression
 
  A %*% B  | A %*% B
  A[, 5] | A(::,5)
  cbind(A,B) | A cbind B
  A * B | A * B
  1 / x | 1 /: x
  t(A) | A.t
  norm(A) | A.norm
  colSums(A) | A.colSums
 
  Where is the struggle here ?
 

 Not in this at all, but all over the place in sparkbindings (the backend of
 the DSL).


User doesn't write spark bindings. Users write scripts. I.e. exactly what
i've shown.

And we (I am confident) are ok with some generics passed around in Mahout's
guts.We probably should expect to be ok with much bigger complexity in fact
than this.

 BTW Spark rdd type is RDD[K:ClassTag] as well. nobody yet complained. not
a single time. For all their list activity.




  I suspect the real reason for all these questions is not architectural,
 but
  rather simplification of H20 bindings.
 
  That is, probably, a really worthy question: are we ready to screw legacy
  algorithm compatibility and existing bindings' merits just to make h2o
  integration easier? This is a good question, but i am far from sure i
 would
  vote yes here.
 

 Well, sure. I would like to simplify H2O bindings to the extent I can (or
 simplify any task I do in any project). I expect not all questions might
 make sense for those who have a bigger context, but I still ask without
 hesitation.


Ok. I apologize. i just assumed you were at different level of familiarity
with both Mahout and distributed stacks. I did not mean to discourage
sincere search for answers.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:




  BTW Spark rdd type is RDD[K:ClassTag] as well. nobody yet complained. not
 a single time. For all their list activity.



Actually it is even scarier in Spark. Consider this type system:

To enable groupBy, for example, RDD needs to match
RDD[(K:ClassTag,V:ClassTag)].

To enable sort, RDD needs to match RDD[(K%Comparable:ClassTag,
V:ClassTag)].

And to enable persisting something to a sequence file, it has to match
RDD[(K%WritableComparable:ClassTag,V%Writable :ClassTag)].

And probably even something else i don't immediately remember.

Compared to these, we are just simplicity itself.

Still, nobody yet thought of complaining about those.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Ted Dunning
On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 I did not mean to discourage
 sincere search for answers.


The tone of answers has lately been very discouraging for those sincerely
searching for answers.  I think we as a community have a responsibility to
do better about this.  There is no need to be insulting to people asking
honest questions in a civil tone.  They may or may not be well informed.
 They may or may not have some reason for asking.  And they may well be
pointing out something in our blind spot that we have retained just because
it was that way before.

I congratulate Anand for sticking with it and I strongly appreciate his
questions.


Re: H2O integration - intermediate progress update

2014-06-18 Thread Dmitriy Lyubimov
On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  I did not mean to discourage
  sincere search for answers.
 

 The tone of answers has lately been very discouraging for those sincerely
 searching for answers.  I think we as a community have a responsibility to
 do better about this.  There is no need to be insulting to people asking
 honest questions in a civil tone.


Ted, we've been at this already. There have been more arguments than
questions. I am just providing my counter arguments. Do you insist on terms
insulting? Cause this, you know, insulting. You are heading ad hominem
direction again.


H2O integration - intermediate progress update

2014-06-17 Thread Anand Avati
Still incomplete, everything does NOT work. But lots of progress and end is
in sight.

- Development happening at
https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
doing lots of commit --amend and git push --force as this is my private
tree.

- Ground level build issues and classloader incompatibilities fixed.

- Can load a matrix into H2O either from in core (through drmParallelize())
or HDFS (parser does not support seqfile yet)

- Only Long type support for Row Keys so far.

- mapBlock() works. This was the trickiest, other ops seem trivial in
comparison.

Everything else yet to be done. However I will be putting in more time into
this over the coming days (was working less than part time on this so far.)

Questions/comments welcome.


Re: H2O integration - intermediate progress update

2014-06-17 Thread Ted Dunning
Very cool, Anand.

Very exciting as it makes the multi-engine story make much more sense.


On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati av...@gluster.org wrote:

 Still incomplete, everything does NOT work. But lots of progress and end is
 in sight.

 - Development happening at
 https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
 doing lots of commit --amend and git push --force as this is my private
 tree.

 - Ground level build issues and classloader incompatibilities fixed.

 - Can load a matrix into H2O either from in core (through drmParallelize())
 or HDFS (parser does not support seqfile yet)

 - Only Long type support for Row Keys so far.

 - mapBlock() works. This was the trickiest, other ops seem trivial in
 comparison.

 Everything else yet to be done. However I will be putting in more time into
 this over the coming days (was working less than part time on this so far.)

 Questions/comments welcome.