Re: H2O integration - intermediate progress update
On Thu, Jun 19, 2014 at 9:36 AM, Dmitriy Lyubimov wrote: > i have an impression that mr. Dunning has masterfully concealed a very > targeted insult in his carefully worded statement with the sole purpose of > forcing certain participants to go into defensive and to turn a technical > discussion into trading insults, in which he has obviously partially > succeeded. > > I have an impression this has not been an isolated incident on mr. > Dunning's part in the past, and i have strong suspicion that it was the > wrong balance of technical merit and posturing in the project that drove > more than one accomplished committer or candidate out in the past. > > I also have been receiving an impression that I am next such target on > mr. Dunnings part just because my arguments are not technically favorable > where he needs them to be favorable for whatever other-than-technical > reason. I love the code in the project, that's in part why i am candid in > its discussions, but it is repeated agrumentum ad hominem from mr. Dunning > that is very close to driving me out. And I don't think beers can smooth > that. > I can say that my only intent was to try to help get the tone on the mailing list back to a more gentle and encouraging path. I did not intend any insult and purposely tried to refer to all of us together as having the problem. In general, the only target I have is building up the Mahout community. I don't want to encourage a negative thread to continue very far, but I do feel that there is a difference between technical discussions about technical merit, technical discussions that descend into personal attacks and discussions about the form, tone and manner of discussions. Only the second is, in my opinion, ad hominem. I think we all agree that it is a bad thing. The first form of discussion is what we should mostly have, but occasionally there needs to be a bit of discussion of the third kind. In particular, occasional feedback such as the impassioned comment by Yash just now that things are not working right can be very, very helpful. Whenever anybody gives this feedback it is important to step back a bit and think about what it means. For instance, even though I disagree with Dmitriy's assessment of my motives, I am going to think carefully about how to improve his impressions. This third kind of discussion can be delicate and difficult. It can be distasteful to have in public, but I think that we all owe it to the project to try to make things work better if we possibly can.
RE: H2O integration - intermediate progress update
I would agree and would second not wanting to hijack the discussion, I'm not a newbie to mahout but to be frank I've seen this tone from committers when evaluating or describing ideas or when judging new work that someone wants to contribute , I would also add that code committs can and should come in from anyone and be judged fairly without immediate and early dismissal of ideas. Frankly I'm interested in committing just for the purposes of learning and the general tone of this discussion is not encouraging to folks interested in shaping/using/adding to mahout for the future. My 2 cents. > From: yash...@gmail.com > Date: Thu, 19 Jun 2014 22:39:49 +0530 > Subject: Re: H2O integration - intermediate progress update > To: dev@mahout.apache.org > > Hi All, > Sorry to hijack the thread. > I am a newbie in mahout community - please pardon my words if anyone finds > them unsuitable. > > Its really strange to see such heated discussions between the Big Shots on > the mailing lists. > I am absolute beginner in this space and it does not leave a very good > impression about the open source community itself. > > What I believe is - Open Source is about love of code and all awesome > coders coming together - designing some of the coolest code projects on the > planet as one *strong team*. Lets not break this belief of newbies. We are > learning from you guys. > > The avengers should not fight. > > Best Regards, > Yash > > > > > On Thu, Jun 19, 2014 at 10:06 PM, Dmitriy Lyubimov > wrote: > > > Well, let me tell my impression. > > > > Remember, we started talking impressions here all over the place, not > > facts. So *don't ask me to prove anything.* > > > > i have an impression that mr. Dunning has masterfully concealed a very > > targeted insult in his carefully worded statement with the sole purpose of > > forcing certain participants to go into defensive and to turn a technical > > discussion into trading insults, in which he has obviously partially > > succeeded. > > > > I have an impression this has not been an isolated incident on mr. > > Dunning's part in the past, and i have strong suspicion that it was the > > wrong balance of technical merit and posturing in the project that drove > > more than one accomplished committer or candidate out in the past. > > > > I also have been receiving an impression that I am next such target on > > mr. Dunnings part just because my arguments are not technically favorable > > where he needs them to be favorable for whatever other-than-technical > > reason. I love the code in the project, that's in part why i am candid in > > its discussions, but it is repeated agrumentum ad hominem from mr. Dunning > > that is very close to driving me out. And I don't think beers can smooth > > that. > > > > As for welcoming, well, h2o is not exactly new topic here. I also think we > > need to have some bar for proposals to meet regardless of being welcoming. > > > > Finally, I have an impression everybody has areas where they possess less > > than brilliant expertise; i actually like to say about myself that "it > > pains me how little i know". I have no problem identifying areas of > > weaknesses in myself publicly and don't consider this to be offensive, > > since i know that the only way to improve knowledge is to first know where > > it is lacking. I am very perceptive to strong logical argument regardless > > if it fits my current world view or not. But I am particularly not fond of > > rhetorical fallacies, informal ones in particular. I am not very fond of > > marketing bluff or empty PR. It is a personal choice whether you accept > > that mindset or not, but grading areas of weakness is not an insult. That's > > what they do in universities all the time, after all. > >
Re: H2O integration - intermediate progress update
On Thu, Jun 19, 2014 at 10:09 AM, Yash Sharma wrote: > What I believe is - Open Source is about love of code and all awesome > coders coming together - designing some of the coolest code projects on the > planet as one *strong team*. Lets not break this belief of newbies. We are > learning from you guys. > > The avengers should not fight. > Thanks Yash. Sounds like you have a very good start here. Community building is very important and your calming words are a good way to encourage it.
Re: H2O integration - intermediate progress update
Hi All, Sorry to hijack the thread. I am a newbie in mahout community - please pardon my words if anyone finds them unsuitable. Its really strange to see such heated discussions between the Big Shots on the mailing lists. I am absolute beginner in this space and it does not leave a very good impression about the open source community itself. What I believe is - Open Source is about love of code and all awesome coders coming together - designing some of the coolest code projects on the planet as one *strong team*. Lets not break this belief of newbies. We are learning from you guys. The avengers should not fight. Best Regards, Yash On Thu, Jun 19, 2014 at 10:06 PM, Dmitriy Lyubimov wrote: > Well, let me tell my impression. > > Remember, we started talking impressions here all over the place, not > facts. So *don't ask me to prove anything.* > > i have an impression that mr. Dunning has masterfully concealed a very > targeted insult in his carefully worded statement with the sole purpose of > forcing certain participants to go into defensive and to turn a technical > discussion into trading insults, in which he has obviously partially > succeeded. > > I have an impression this has not been an isolated incident on mr. > Dunning's part in the past, and i have strong suspicion that it was the > wrong balance of technical merit and posturing in the project that drove > more than one accomplished committer or candidate out in the past. > > I also have been receiving an impression that I am next such target on > mr. Dunnings part just because my arguments are not technically favorable > where he needs them to be favorable for whatever other-than-technical > reason. I love the code in the project, that's in part why i am candid in > its discussions, but it is repeated agrumentum ad hominem from mr. Dunning > that is very close to driving me out. And I don't think beers can smooth > that. > > As for welcoming, well, h2o is not exactly new topic here. I also think we > need to have some bar for proposals to meet regardless of being welcoming. > > Finally, I have an impression everybody has areas where they possess less > than brilliant expertise; i actually like to say about myself that "it > pains me how little i know". I have no problem identifying areas of > weaknesses in myself publicly and don't consider this to be offensive, > since i know that the only way to improve knowledge is to first know where > it is lacking. I am very perceptive to strong logical argument regardless > if it fits my current world view or not. But I am particularly not fond of > rhetorical fallacies, informal ones in particular. I am not very fond of > marketing bluff or empty PR. It is a personal choice whether you accept > that mindset or not, but grading areas of weakness is not an insult. That's > what they do in universities all the time, after all. >
Re: H2O integration - intermediate progress update
Well, let me tell my impression. Remember, we started talking impressions here all over the place, not facts. So *don't ask me to prove anything.* i have an impression that mr. Dunning has masterfully concealed a very targeted insult in his carefully worded statement with the sole purpose of forcing certain participants to go into defensive and to turn a technical discussion into trading insults, in which he has obviously partially succeeded. I have an impression this has not been an isolated incident on mr. Dunning's part in the past, and i have strong suspicion that it was the wrong balance of technical merit and posturing in the project that drove more than one accomplished committer or candidate out in the past. I also have been receiving an impression that I am next such target on mr. Dunnings part just because my arguments are not technically favorable where he needs them to be favorable for whatever other-than-technical reason. I love the code in the project, that's in part why i am candid in its discussions, but it is repeated agrumentum ad hominem from mr. Dunning that is very close to driving me out. And I don't think beers can smooth that. As for welcoming, well, h2o is not exactly new topic here. I also think we need to have some bar for proposals to meet regardless of being welcoming. Finally, I have an impression everybody has areas where they possess less than brilliant expertise; i actually like to say about myself that "it pains me how little i know". I have no problem identifying areas of weaknesses in myself publicly and don't consider this to be offensive, since i know that the only way to improve knowledge is to first know where it is lacking. I am very perceptive to strong logical argument regardless if it fits my current world view or not. But I am particularly not fond of rhetorical fallacies, informal ones in particular. I am not very fond of marketing bluff or empty PR. It is a personal choice whether you accept that mindset or not, but grading areas of weakness is not an insult. That's what they do in universities all the time, after all.
Re: H2O integration - intermediate progress update
I share the impression that the tone of conversation has not been very welcoming lately, be it intentional or not. I think that we should remind ourselves why we are working on open source and try to improve our ways of communication. I think we should try to get as much people as possible together to sit on a table and have some face-to-face discussion during a beer or coffee. --sebastian On 06/19/2014 07:18 AM, Dmitriy Lyubimov wrote: On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning wrote: On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov wrote: I did not mean to discourage sincere search for answers. The tone of answers has lately been very discouraging for those sincerely searching for answers. I think we as a community have a responsibility to do better about this. There is no need to be insulting to people asking honest questions in a civil tone. Ted, we've been at this already. There have been more arguments than questions. I am just providing my counter arguments. Do you insist on terms "insulting"? Cause this, you know, insulting. You are heading ad hominem direction again.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning wrote: > On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov > wrote: > > > I did not mean to discourage > > sincere search for answers. > > > > The tone of answers has lately been very discouraging for those sincerely > searching for answers. I think we as a community have a responsibility to > do better about this. There is no need to be insulting to people asking > honest questions in a civil tone. Ted, we've been at this already. There have been more arguments than questions. I am just providing my counter arguments. Do you insist on terms "insulting"? Cause this, you know, insulting. You are heading ad hominem direction again.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov wrote: > I did not mean to discourage > sincere search for answers. > The tone of answers has lately been very discouraging for those sincerely searching for answers. I think we as a community have a responsibility to do better about this. There is no need to be insulting to people asking honest questions in a civil tone. They may or may not be well informed. They may or may not have some reason for asking. And they may well be pointing out something in our blind spot that we have retained just because it was that way before. I congratulate Anand for sticking with it and I strongly appreciate his questions.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov wrote: > > > > BTW Spark rdd type is RDD[K:ClassTag] as well. nobody yet complained. not > a single time. For all their list activity. > > >> Actually it is even "scarier" in Spark. Consider this type system: To enable groupBy, for example, RDD needs to match RDD[(K:ClassTag,V:ClassTag)]. To enable sort, RDD needs to match RDD[(K<%Comparable:ClassTag, V:ClassTag)]. And to enable persisting something to a sequence file, it has to match RDD[(K<%WritableComparable:ClassTag,V<%Writable :ClassTag)]. And probably even something else i don't immediately remember. Compared to these, we are just simplicity itself. Still, nobody yet thought of complaining about those.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 9:32 PM, Anand Avati wrote: > On Wed, Jun 18, 2014 at 9:24 PM, Dmitriy Lyubimov > wrote: > > > On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati wrote: > > > > > Also, if we did not have Keys in DRM, most of > > > the code in the DSL need not have a type parameter, making it so much > > > simpler for a first timer to read.. > > > > > > > This is also something i absolutely not sure where it is coming from. > > > > Let's see: > > > > Mahout expression | R expression > > > > A %*% B | A %*% B > > A[, 5] | A(::,5) > > cbind(A,B) | A cbind B > > A * B | A * B > > 1 / x | 1 /: x > > t(A) | A.t > > norm(A) | A.norm > > colSums(A) | A.colSums > > > > Where is the "struggle" here ? > > > > Not in this at all, but all over the place in sparkbindings (the backend of > the DSL). > User doesn't write spark bindings. Users write scripts. I.e. exactly what i've shown. And we (I am confident) are ok with some generics passed around in Mahout's guts.We probably should expect to be ok with much bigger complexity in fact than this. BTW Spark rdd type is RDD[K:ClassTag] as well. nobody yet complained. not a single time. For all their list activity. > > > > I suspect the real reason for all these questions is not architectural, > but > > rather simplification of H20 bindings. > > > > That is, probably, a really worthy question: are we ready to screw legacy > > algorithm compatibility and existing bindings' merits just to make h2o > > integration easier? This is a good question, but i am far from sure i > would > > vote "yes" here. > > > > Well, sure. I would like to simplify H2O bindings to the extent I can (or > simplify any task I do in any project). I expect not all questions might > make sense for those who have a bigger context, but I still ask without > hesitation. > Ok. I apologize. i just assumed you were at different level of familiarity with both Mahout and distributed stacks. I did not mean to discourage sincere search for answers.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 9:24 PM, Dmitriy Lyubimov wrote: > On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati wrote: > > > Also, if we did not have Keys in DRM, most of > > the code in the DSL need not have a type parameter, making it so much > > simpler for a first timer to read.. > > > > This is also something i absolutely not sure where it is coming from. > > Let's see: > > Mahout expression | R expression > > A %*% B | A %*% B > A[, 5] | A(::,5) > cbind(A,B) | A cbind B > A * B | A * B > 1 / x | 1 /: x > t(A) | A.t > norm(A) | A.norm > colSums(A) | A.colSums > > Where is the "struggle" here ? > Not in this at all, but all over the place in sparkbindings (the backend of the DSL). > I suspect the real reason for all these questions is not architectural, but > rather simplification of H20 bindings. > > That is, probably, a really worthy question: are we ready to screw legacy > algorithm compatibility and existing bindings' merits just to make h2o > integration easier? This is a good question, but i am far from sure i would > vote "yes" here. > Well, sure. I would like to simplify H2O bindings to the extent I can (or simplify any task I do in any project). I expect not all questions might make sense for those who have a bigger context, but I still ask without hesitation.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 9:10 PM, Dmitriy Lyubimov wrote: > On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati wrote: > > > I see that this key'ing is an artifact of the sequencefile format > (reading > > more about it just now). > > > I view it differently. Having to have ordinal keys on columns is an > artifact of sequence file format. Or Mahout legacy, whatever. row keys are > not constrained to anything. One could require int keys (and a lot of > operations do). > > Sequence file indeed has two payload spots in a record, but it doesn't > constrain you to not having keys, or having 333 keys. The only essential > function of sequence file is sync-able splittability and payload > compression abstraction. People use plain text files with mapreduce for the > same reason, but they don't have clear key-value structure. > > > > > > As I'm reading it also feels like sequencefile is > > really designed with the map/reduce framework in mind, > > > again, not true, it is designed with data affinity in mind. Spark requires > (or, rather, benefits from) data affinity just as much as map reduce, and > so does Stratoshpere, and, to much smaller degree, HBase. Any parallel > system that sends code to the data, and not the other way around, would > require some notion of data partitioning, both in persistent state and > in-memory. > > It would seem to me you hold a lot of misconceptions about why and what > exists in Hadoop (not that all that exists there, exists for a good reason > though; and what exists for a good reason, usually could be tons times > better). > I'm only learning about Hadoop now. I'm very new to it. Wouldn't be surprised if I have misconceptions of a few things!
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati wrote: > Also, if we did not have Keys in DRM, most of > the code in the DSL need not have a type parameter, making it so much > simpler for a first timer to read.. > This is also something i absolutely not sure where it is coming from. Let's see: Mahout expression | R expression A %*% B | A %*% B A[, 5] | A(::,5) cbind(A,B) | A cbind B A * B | A * B 1 / x | 1 /: x t(A) | A.t norm(A) | A.norm colSums(A) | A.colSums Where is the "struggle" here ? I suspect the real reason for all these questions is not architectural, but rather simplification of H20 bindings. That is, probably, a really worthy question: are we ready to screw legacy algorithm compatibility and existing bindings' merits just to make h2o integration easier? This is a good question, but i am far from sure i would vote "yes" here.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 8:13 PM, Anand Avati wrote: > I see that this key'ing is an artifact of the sequencefile format (reading > more about it just now). I view it differently. Having to have ordinal keys on columns is an artifact of sequence file format. Or Mahout legacy, whatever. row keys are not constrained to anything. One could require int keys (and a lot of operations do). Sequence file indeed has two payload spots in a record, but it doesn't constrain you to not having keys, or having 333 keys. The only essential function of sequence file is sync-able splittability and payload compression abstraction. People use plain text files with mapreduce for the same reason, but they don't have clear key-value structure. > As I'm reading it also feels like sequencefile is > really designed with the map/reduce framework in mind, again, not true, it is designed with data affinity in mind. Spark requires (or, rather, benefits from) data affinity just as much as map reduce, and so does Stratoshpere, and, to much smaller degree, HBase. Any parallel system that sends code to the data, and not the other way around, would require some notion of data partitioning, both in persistent state and in-memory. It would seem to me you hold a lot of misconceptions about why and what exists in Hadoop (not that all that exists there, exists for a good reason though; and what exists for a good reason, usually could be tons times better).
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati wrote: > Would it not be possible (or even a good idea) to keep row keys completely > separate from DRM, and let DRMs be pure nRow x nCol numbers? Considering this is only at the cost of breaking compatibility with all MR stuff that's been done in Mahout since 2008. Not an option. But suppose legacy was not a problem, I see signficant benefits in allowing non-ordinal keys. One thing, data almost never comes out of ETL pipelines with ordinary-enforced keys. Normalizing ordinarity would be a pain. There's normalization issue for dense data, and there's uniqueness requirement for sparse data (in which case it really is no different from any key with only requirements for hash/equals contracts) Second, having to map to integral keys is creating problems relating and maintaining relations of the stuff back to its origins. Given it's already there, being in a position of an architect, I'd never give it back. > None of the > operators (so far) care about the keys. Simply not true. LSA does, clustering does, and about other dozen cases in and outside Mahout. Assuming we are still to support algorithms we have not deprecated to date. > At least none of the existing > mapBlock() users do anything with the key. not true. Not all examples in Mahout, but not true. > I'm not sure if we can do > anything meaningful with the key in a mapBlock. You not being sure is not sufficient condition. Sufficient condition everyone has to be sure to the contrary. It is always hard to argue non-existence of a counter example from positions of probabilities or intuition. > It feels they are tightly > coupled while they need not have been. I must admit I'm new to this, but it > feels like - keys could be stored in a separate file, and matrix numbers in > another. Mahout (should) only care about and operate on Matrix numbers, > reads from the "number" file, writes output to a new "number" file, and the > user can use the new number file with the old/original "key file" - > effectively the same result as loading keys and moving them around through > all the operations and writing back. Am I missing something fundamental? > All i said. legacy, ordinality enforcement etc. etc. > > Thanks > > > On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov > wrote: > > > Looking at the code, i am still not sure without trying. > > > > but i am more inclined to think now that this specific combination, A'B > > with A and B non-int row keys, is not supported. > > > > As a general principle, we followed where our guinea pigs get us, and > were > > not trying to fill all possible gaps and holes, with the belief that will > > get us 80/20 caps in shortest time. > > > > As for the rest, we wait for somebody to ask for it because they need it. > > > > But that example is legal and patch should be fundamentally possible and > > easy enough to handle this case within this architecture. > > > > > > > > > > On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov > > wrote: > > > > > also, if something is not supported, such as your example, (if it is > not > > > supported), optimizer would simply state so with rejection. But if it > > takes > > > it in, then I am pretty sure it will do the right job (or at least > > there's > > > a unit test for that case that is asserted on a trivial example). > > > > > > Here, by trivial i mean local pipelines for 2-split inputs, that's the > > > general rule i used. > > > > > > > > > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov > > > wrote: > > > > > >> a little bit of additional information is that for rewriting rules > stage > > >> optimizer does 3 passes over semantic tree, each pass matching a tree > > >> fragment using Scala case class matching and rewriting. This allows to > > >> match and rewrite pretty elaborate tree structure fragments, although > at > > >> the moment i don't think we dig farther than immediate children, and > > >> perhaps some their known attributes, in most cases. > > >> > > >> More detailed description that that i think is only in reading the > > source. > > >> > > >> > > >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov > > >> wrote: > > >> > > >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b > is > > >>> int-keyed. > > >>> > > >>> This is kind of not the point. the point is that you can easily > modify > > >>> rewriting rules and operators to cover misses. (there shouldn't be > > many, > > >>> since we've already written quite a bit of expressions out there). > > >>> > > >>> > > >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov > > > >>> wrote: > > >>> > > I am not sure. There are more rewriting rules than i can remember, > and > > i did not write an algorithm ( i think) that would involve this > > combination. I guess the best thing is to try in a shell or a unit > > test. if > > it falls thru, perhaps a new plan element needs to be added > (although > > I am > > not v
Re: H2O integration - intermediate progress update
I see that this key'ing is an artifact of the sequencefile format (reading more about it just now). As I'm reading it also feels like sequencefile is really designed with the map/reduce framework in mind, suited well for the mapper API. It also feels like, in the real world, data is generated/available in a different and "more natural" formats, and an ingestion phase converts the more "natural" file into a sequencefile just for mapreduce processing. Naive question - Is it still relevant to support this format, given the move away from MR within Mahout? Why design the core data structure around a format from the framework we moved away? Why not work off just CSV files etc.? Also, if we did not have Keys in DRM, most of the code in the DSL need not have a type parameter, making it so much simpler for a first timer to read.. thanks! On Wed, Jun 18, 2014 at 7:20 PM, Anand Avati wrote: > Would it not be possible (or even a good idea) to keep row keys completely > separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the > operators (so far) care about the keys. At least none of the existing > mapBlock() users do anything with the key. I'm not sure if we can do > anything meaningful with the key in a mapBlock. It feels they are tightly > coupled while they need not have been. I must admit I'm new to this, but it > feels like - keys could be stored in a separate file, and matrix numbers in > another. Mahout (should) only care about and operate on Matrix numbers, > reads from the "number" file, writes output to a new "number" file, and the > user can use the new number file with the old/original "key file" - > effectively the same result as loading keys and moving them around through > all the operations and writing back. Am I missing something fundamental? > > Thanks > > > On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov > wrote: > >> Looking at the code, i am still not sure without trying. >> >> but i am more inclined to think now that this specific combination, A'B >> with A and B non-int row keys, is not supported. >> >> As a general principle, we followed where our guinea pigs get us, and were >> not trying to fill all possible gaps and holes, with the belief that will >> get us 80/20 caps in shortest time. >> >> As for the rest, we wait for somebody to ask for it because they need it. >> >> But that example is legal and patch should be fundamentally possible and >> easy enough to handle this case within this architecture. >> >> >> >> >> On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov >> wrote: >> >> > also, if something is not supported, such as your example, (if it is not >> > supported), optimizer would simply state so with rejection. But if it >> takes >> > it in, then I am pretty sure it will do the right job (or at least >> there's >> > a unit test for that case that is asserted on a trivial example). >> > >> > Here, by trivial i mean local pipelines for 2-split inputs, that's the >> > general rule i used. >> > >> > >> > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov >> > wrote: >> > >> >> a little bit of additional information is that for rewriting rules >> stage >> >> optimizer does 3 passes over semantic tree, each pass matching a tree >> >> fragment using Scala case class matching and rewriting. This allows to >> >> match and rewrite pretty elaborate tree structure fragments, although >> at >> >> the moment i don't think we dig farther than immediate children, and >> >> perhaps some their known attributes, in most cases. >> >> >> >> More detailed description that that i think is only in reading the >> source. >> >> >> >> >> >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov >> >> wrote: >> >> >> >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is >> >>> int-keyed. >> >>> >> >>> This is kind of not the point. the point is that you can easily modify >> >>> rewriting rules and operators to cover misses. (there shouldn't be >> many, >> >>> since we've already written quite a bit of expressions out there). >> >>> >> >>> >> >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov >> >>> wrote: >> >>> >> I am not sure. There are more rewriting rules than i can remember, >> and >> i did not write an algorithm ( i think) that would involve this >> combination. I guess the best thing is to try in a shell or a unit >> test. if >> it falls thru, perhaps a new plan element needs to be added >> (although I am >> not very sure there isn't already). I know that there are join-based >> multiplicative operators there. >> >> >> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning >> wrote: >> >> > On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov < >> dlie...@gmail.com> >> > wrote: >> > >> > > in simple terms, if non-integer row keying is used anywhere, it >> > tries to >> > > rewrite pipelines so that product orientations never require >> non-int >> > keys >> > > to denote columns. In ca
Re: H2O integration - intermediate progress update
Would it not be possible (or even a good idea) to keep row keys completely separate from DRM, and let DRMs be pure nRow x nCol numbers? None of the operators (so far) care about the keys. At least none of the existing mapBlock() users do anything with the key. I'm not sure if we can do anything meaningful with the key in a mapBlock. It feels they are tightly coupled while they need not have been. I must admit I'm new to this, but it feels like - keys could be stored in a separate file, and matrix numbers in another. Mahout (should) only care about and operate on Matrix numbers, reads from the "number" file, writes output to a new "number" file, and the user can use the new number file with the old/original "key file" - effectively the same result as loading keys and moving them around through all the operations and writing back. Am I missing something fundamental? Thanks On Wed, Jun 18, 2014 at 6:49 PM, Dmitriy Lyubimov wrote: > Looking at the code, i am still not sure without trying. > > but i am more inclined to think now that this specific combination, A'B > with A and B non-int row keys, is not supported. > > As a general principle, we followed where our guinea pigs get us, and were > not trying to fill all possible gaps and holes, with the belief that will > get us 80/20 caps in shortest time. > > As for the rest, we wait for somebody to ask for it because they need it. > > But that example is legal and patch should be fundamentally possible and > easy enough to handle this case within this architecture. > > > > > On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov > wrote: > > > also, if something is not supported, such as your example, (if it is not > > supported), optimizer would simply state so with rejection. But if it > takes > > it in, then I am pretty sure it will do the right job (or at least > there's > > a unit test for that case that is asserted on a trivial example). > > > > Here, by trivial i mean local pipelines for 2-split inputs, that's the > > general rule i used. > > > > > > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov > > wrote: > > > >> a little bit of additional information is that for rewriting rules stage > >> optimizer does 3 passes over semantic tree, each pass matching a tree > >> fragment using Scala case class matching and rewriting. This allows to > >> match and rewrite pretty elaborate tree structure fragments, although at > >> the moment i don't think we dig farther than immediate children, and > >> perhaps some their known attributes, in most cases. > >> > >> More detailed description that that i think is only in reading the > source. > >> > >> > >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov > >> wrote: > >> > >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is > >>> int-keyed. > >>> > >>> This is kind of not the point. the point is that you can easily modify > >>> rewriting rules and operators to cover misses. (there shouldn't be > many, > >>> since we've already written quite a bit of expressions out there). > >>> > >>> > >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov > >>> wrote: > >>> > I am not sure. There are more rewriting rules than i can remember, and > i did not write an algorithm ( i think) that would involve this > combination. I guess the best thing is to try in a shell or a unit > test. if > it falls thru, perhaps a new plan element needs to be added (although > I am > not very sure there isn't already). I know that there are join-based > multiplicative operators there. > > > On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning > wrote: > > > On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov > > > wrote: > > > > > in simple terms, if non-integer row keying is used anywhere, it > > tries to > > > rewrite pipelines so that product orientations never require > non-int > > keys > > > to denote columns. In case pipeline makes it impossible, optimizer > > will > > > refuse to produce a plan. > > > > > > e.g. suppose A is distributed string-keyed. > > > > > > (A.t %.% A) collect // ok > > > > > > > What happens with the important case of B.t %.% A where both A and B > > are > > string keyed? > > > > > >>> > >> > > >
Re: H2O integration - intermediate progress update
Looking at the code, i am still not sure without trying. but i am more inclined to think now that this specific combination, A'B with A and B non-int row keys, is not supported. As a general principle, we followed where our guinea pigs get us, and were not trying to fill all possible gaps and holes, with the belief that will get us 80/20 caps in shortest time. As for the rest, we wait for somebody to ask for it because they need it. But that example is legal and patch should be fundamentally possible and easy enough to handle this case within this architecture. On Wed, Jun 18, 2014 at 6:29 PM, Dmitriy Lyubimov wrote: > also, if something is not supported, such as your example, (if it is not > supported), optimizer would simply state so with rejection. But if it takes > it in, then I am pretty sure it will do the right job (or at least there's > a unit test for that case that is asserted on a trivial example). > > Here, by trivial i mean local pipelines for 2-split inputs, that's the > general rule i used. > > > On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov > wrote: > >> a little bit of additional information is that for rewriting rules stage >> optimizer does 3 passes over semantic tree, each pass matching a tree >> fragment using Scala case class matching and rewriting. This allows to >> match and rewrite pretty elaborate tree structure fragments, although at >> the moment i don't think we dig farther than immediate children, and >> perhaps some their known attributes, in most cases. >> >> More detailed description that that i think is only in reading the source. >> >> >> On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov >> wrote: >> >>> E.g. i know for sure A %.% B is legal where A is string-keyed and b is >>> int-keyed. >>> >>> This is kind of not the point. the point is that you can easily modify >>> rewriting rules and operators to cover misses. (there shouldn't be many, >>> since we've already written quite a bit of expressions out there). >>> >>> >>> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov >>> wrote: >>> I am not sure. There are more rewriting rules than i can remember, and i did not write an algorithm ( i think) that would involve this combination. I guess the best thing is to try in a shell or a unit test. if it falls thru, perhaps a new plan element needs to be added (although I am not very sure there isn't already). I know that there are join-based multiplicative operators there. On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning wrote: > On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov > wrote: > > > in simple terms, if non-integer row keying is used anywhere, it > tries to > > rewrite pipelines so that product orientations never require non-int > keys > > to denote columns. In case pipeline makes it impossible, optimizer > will > > refuse to produce a plan. > > > > e.g. suppose A is distributed string-keyed. > > > > (A.t %.% A) collect // ok > > > > What happens with the important case of B.t %.% A where both A and B > are > string keyed? > >>> >> >
Re: H2O integration - intermediate progress update
also, if something is not supported, such as your example, (if it is not supported), optimizer would simply state so with rejection. But if it takes it in, then I am pretty sure it will do the right job (or at least there's a unit test for that case that is asserted on a trivial example). Here, by trivial i mean local pipelines for 2-split inputs, that's the general rule i used. On Wed, Jun 18, 2014 at 6:26 PM, Dmitriy Lyubimov wrote: > a little bit of additional information is that for rewriting rules stage > optimizer does 3 passes over semantic tree, each pass matching a tree > fragment using Scala case class matching and rewriting. This allows to > match and rewrite pretty elaborate tree structure fragments, although at > the moment i don't think we dig farther than immediate children, and > perhaps some their known attributes, in most cases. > > More detailed description that that i think is only in reading the source. > > > On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov > wrote: > >> E.g. i know for sure A %.% B is legal where A is string-keyed and b is >> int-keyed. >> >> This is kind of not the point. the point is that you can easily modify >> rewriting rules and operators to cover misses. (there shouldn't be many, >> since we've already written quite a bit of expressions out there). >> >> >> On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov >> wrote: >> >>> I am not sure. There are more rewriting rules than i can remember, and i >>> did not write an algorithm ( i think) that would involve this combination. >>> I guess the best thing is to try in a shell or a unit test. if it falls >>> thru, perhaps a new plan element needs to be added (although I am not very >>> sure there isn't already). I know that there are join-based multiplicative >>> operators there. >>> >>> >>> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning >>> wrote: >>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov wrote: > in simple terms, if non-integer row keying is used anywhere, it tries to > rewrite pipelines so that product orientations never require non-int keys > to denote columns. In case pipeline makes it impossible, optimizer will > refuse to produce a plan. > > e.g. suppose A is distributed string-keyed. > > (A.t %.% A) collect // ok > What happens with the important case of B.t %.% A where both A and B are string keyed? >>> >>> >> >
Re: H2O integration - intermediate progress update
a little bit of additional information is that for rewriting rules stage optimizer does 3 passes over semantic tree, each pass matching a tree fragment using Scala case class matching and rewriting. This allows to match and rewrite pretty elaborate tree structure fragments, although at the moment i don't think we dig farther than immediate children, and perhaps some their known attributes, in most cases. More detailed description that that i think is only in reading the source. On Wed, Jun 18, 2014 at 6:19 PM, Dmitriy Lyubimov wrote: > E.g. i know for sure A %.% B is legal where A is string-keyed and b is > int-keyed. > > This is kind of not the point. the point is that you can easily modify > rewriting rules and operators to cover misses. (there shouldn't be many, > since we've already written quite a bit of expressions out there). > > > On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov > wrote: > >> I am not sure. There are more rewriting rules than i can remember, and i >> did not write an algorithm ( i think) that would involve this combination. >> I guess the best thing is to try in a shell or a unit test. if it falls >> thru, perhaps a new plan element needs to be added (although I am not very >> sure there isn't already). I know that there are join-based multiplicative >> operators there. >> >> >> On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning >> wrote: >> >>> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov >>> wrote: >>> >>> > in simple terms, if non-integer row keying is used anywhere, it tries >>> to >>> > rewrite pipelines so that product orientations never require non-int >>> keys >>> > to denote columns. In case pipeline makes it impossible, optimizer will >>> > refuse to produce a plan. >>> > >>> > e.g. suppose A is distributed string-keyed. >>> > >>> > (A.t %.% A) collect // ok >>> > >>> >>> What happens with the important case of B.t %.% A where both A and B are >>> string keyed? >>> >> >> >
Re: H2O integration - intermediate progress update
E.g. i know for sure A %.% B is legal where A is string-keyed and b is int-keyed. This is kind of not the point. the point is that you can easily modify rewriting rules and operators to cover misses. (there shouldn't be many, since we've already written quite a bit of expressions out there). On Wed, Jun 18, 2014 at 6:15 PM, Dmitriy Lyubimov wrote: > I am not sure. There are more rewriting rules than i can remember, and i > did not write an algorithm ( i think) that would involve this combination. > I guess the best thing is to try in a shell or a unit test. if it falls > thru, perhaps a new plan element needs to be added (although I am not very > sure there isn't already). I know that there are join-based multiplicative > operators there. > > > On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning > wrote: > >> On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov >> wrote: >> >> > in simple terms, if non-integer row keying is used anywhere, it tries to >> > rewrite pipelines so that product orientations never require non-int >> keys >> > to denote columns. In case pipeline makes it impossible, optimizer will >> > refuse to produce a plan. >> > >> > e.g. suppose A is distributed string-keyed. >> > >> > (A.t %.% A) collect // ok >> > >> >> What happens with the important case of B.t %.% A where both A and B are >> string keyed? >> > >
Re: H2O integration - intermediate progress update
I am not sure. There are more rewriting rules than i can remember, and i did not write an algorithm ( i think) that would involve this combination. I guess the best thing is to try in a shell or a unit test. if it falls thru, perhaps a new plan element needs to be added (although I am not very sure there isn't already). I know that there are join-based multiplicative operators there. On Wed, Jun 18, 2014 at 6:11 PM, Ted Dunning wrote: > On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov > wrote: > > > in simple terms, if non-integer row keying is used anywhere, it tries to > > rewrite pipelines so that product orientations never require non-int keys > > to denote columns. In case pipeline makes it impossible, optimizer will > > refuse to produce a plan. > > > > e.g. suppose A is distributed string-keyed. > > > > (A.t %.% A) collect // ok > > > > What happens with the important case of B.t %.% A where both A and B are > string keyed? >
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 6:07 PM, Dmitriy Lyubimov wrote: > in simple terms, if non-integer row keying is used anywhere, it tries to > rewrite pipelines so that product orientations never require non-int keys > to denote columns. In case pipeline makes it impossible, optimizer will > refuse to produce a plan. > > e.g. suppose A is distributed string-keyed. > > (A.t %.% A) collect // ok > What happens with the important case of B.t %.% A where both A and B are string keyed?
Re: H2O integration - intermediate progress update
in simple terms, if non-integer row keying is used anywhere, it tries to rewrite pipelines so that product orientations never require non-int keys to denote columns. In case pipeline makes it impossible, optimizer will refuse to produce a plan. e.g. suppose A is distributed string-keyed. (A.t %.% A) collect // ok A.t collect // optimizer error val (U, V, s) = dssvd(A) // OK, U keyed same way as A val (U,V) = dals (A) // OK too etc. etc. On Wed, Jun 18, 2014 at 6:02 PM, Dmitriy Lyubimov wrote: > > > > On Wed, Jun 18, 2014 at 5:58 PM, Ted Dunning > wrote: > >> On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov >> wrote: >> >> > > >> > > Or simply that rows and columns are labeled? >> > > >> > rows are labeled. but they have algebraic signficance. >> > >> >> Do they really? >> >> For the in-core system, if I add two matrices with different row labels, >> the row labels are ignored. > > > In-core system has always hard ordinal indexing. The out-of-core system > has only hard ordinal indexing for columns, or rows when they are > int-keyed. > > If I multiply two matrices where the column >> labels of the first matrix are in a different order than the row labels of >> the second, the labels are again ignore. If I do the transpose >> multiplication where the row labels aren't in the same order, again, no >> effect. >> >> Does the DSL actually permute the rows to make operations work correctly? >> > > You'd be surprised :) > >
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 6:02 PM, Dmitriy Lyubimov wrote: > > Does the DSL actually permute the rows to make operations work correctly? > > > > You'd be surprised :) > I might be or not, but I am not surprised by this answer. What does the DSL actually do?
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 5:58 PM, Ted Dunning wrote: > On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov > wrote: > > > > > > > Or simply that rows and columns are labeled? > > > > > rows are labeled. but they have algebraic signficance. > > > > Do they really? > > For the in-core system, if I add two matrices with different row labels, > the row labels are ignored. In-core system has always hard ordinal indexing. The out-of-core system has only hard ordinal indexing for columns, or rows when they are int-keyed. If I multiply two matrices where the column > labels of the first matrix are in a different order than the row labels of > the second, the labels are again ignore. If I do the transpose > multiplication where the row labels aren't in the same order, again, no > effect. > > Does the DSL actually permute the rows to make operations work correctly? > You'd be surprised :)
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 5:48 PM, Dmitriy Lyubimov wrote: > > > > Or simply that rows and columns are labeled? > > > rows are labeled. but they have algebraic signficance. > Do they really? For the in-core system, if I add two matrices with different row labels, the row labels are ignored. If I multiply two matrices where the column labels of the first matrix are in a different order than the row labels of the second, the labels are again ignore. If I do the transpose multiplication where the row labels aren't in the same order, again, no effect. Does the DSL actually permute the rows to make operations work correctly?
Re: H2O integration - intermediate progress update
> Are you saying that the values in the matrix are non-numbers? > No, our matrices are Real. but Anand was referring to row key support which can be any type with a Writable view bound (in scala terms; also true with their persistence in Mahout sequence file DRM format). > > Or simply that rows and columns are labeled? > rows are labeled. but they have algebraic signficance. > > I was trying to say the latter and add that the core of the matrix is > entirely numerical. This is certainly true of the in-core math. > True. But again, we were not discussing matrix elements. Just the labels.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 5:39 PM, Dmitriy Lyubimov wrote: > > Also, note that the row keys in Mahout are not actually stored in the > > matrices that we manipulate. > > > They are. I am not sure about DistributedRowMatrix class for mapreduce, but > in sparkbindings they are. they are intimately relevant to all algebra and > especially transposition rewrites. > > Even in-core matrices support column/row labels, although nobody seems to > be using it. > Are you saying that the values in the matrix are non-numbers? Or simply that rows and columns are labeled? I was trying to say the latter and add that the core of the matrix is entirely numerical. This is certainly true of the in-core math.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 5:35 PM, Ted Dunning wrote: > Also, note that the row keys in Mahout are not actually stored in the > matrices that we manipulate. They are. I am not sure about DistributedRowMatrix class for mapreduce, but in sparkbindings they are. they are intimately relevant to all algebra and especially transposition rewrites. Even in-core matrices support column/row labels, although nobody seems to be using it. > If the keys can be handled separately, > outside of the flow for the data in a drm, then you should be pretty much > good to go. > > > > > On Wed, Jun 18, 2014 at 5:34 PM, Ted Dunning > wrote: > > > > > On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov > > wrote: > > > >> > How important are the String row keys for the algorithms itself? Would > >> it > >> > grossly mess up a workflow if Strings are silently discarded by the > >> > backend? > >> > > >> > >> like i said, seq2sparse produces them, and postprocessing for stuff like > >> LSA pipelines would not work. > > > > > > Something as coarse as translating to a dictionary index would probably > > work. Creating the dictionary in parallel while reading the data should > be > > quite doable. > > > > >
Re: H2O integration - intermediate progress update
Also, note that the row keys in Mahout are not actually stored in the matrices that we manipulate. If the keys can be handled separately, outside of the flow for the data in a drm, then you should be pretty much good to go. On Wed, Jun 18, 2014 at 5:34 PM, Ted Dunning wrote: > > On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov > wrote: > >> > How important are the String row keys for the algorithms itself? Would >> it >> > grossly mess up a workflow if Strings are silently discarded by the >> > backend? >> > >> >> like i said, seq2sparse produces them, and postprocessing for stuff like >> LSA pipelines would not work. > > > Something as coarse as translating to a dictionary index would probably > work. Creating the dictionary in parallel while reading the data should be > quite doable. > >
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 12:03 PM, Dmitriy Lyubimov wrote: > > How important are the String row keys for the algorithms itself? Would it > > grossly mess up a workflow if Strings are silently discarded by the > > backend? > > > > like i said, seq2sparse produces them, and postprocessing for stuff like > LSA pipelines would not work. Something as coarse as translating to a dictionary index would probably work. Creating the dictionary in parallel while reading the data should be quite doable.
Re: H2O integration - intermediate progress update
On Wed, Jun 18, 2014 at 11:47 AM, Anand Avati wrote: > Supporting Int and Long keys are easy, both should be working shortly. > String is tricky, as H2O stores only numbers. One suggestion has been to > break up the string into bytes and store them as separate columns (and > re-assemble them on demand). I'll look into String support after finishing > the operators. > > How important are the String row keys for the algorithms itself? Would it > grossly mess up a workflow if Strings are silently discarded by the > backend? > like i said, seq2sparse produces them, and postprocessing for stuff like LSA pipelines would not work. > > > On Wed, Jun 18, 2014 at 10:58 AM, Dmitriy Lyubimov > wrote: > > > Supporting Int and String keys are perhaps minimum set (Long is welcome, > > but a second-class citizen) > > > > supporting of DrmLike[Int] is required for a lot of things (e.g. > > Transpose). DrmLike[String] is used in outputs of popular vectorizations > in > > Mahout such as seq2sparse. > > > > > > On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati wrote: > > > > > Still incomplete, everything does NOT work. But lots of progress and > end > > is > > > in sight. > > > > > > - Development happening at > > > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm > still > > > doing lots of commit --amend and git push --force as this is my private > > > tree. > > > > > > - Ground level build issues and classloader incompatibilities fixed. > > > > > > - Can load a matrix into H2O either from in core (through > > drmParallelize()) > > > or HDFS (parser does not support seqfile yet) > > > > > > - Only Long type support for Row Keys so far. > > > > > > - mapBlock() works. This was the trickiest, other ops seem trivial in > > > comparison. > > > > > > Everything else yet to be done. However I will be putting in more time > > into > > > this over the coming days (was working less than part time on this so > > far.) > > > > > > Questions/comments welcome. > > > > > >
Re: H2O integration - intermediate progress update
Supporting Int and Long keys are easy, both should be working shortly. String is tricky, as H2O stores only numbers. One suggestion has been to break up the string into bytes and store them as separate columns (and re-assemble them on demand). I'll look into String support after finishing the operators. How important are the String row keys for the algorithms itself? Would it grossly mess up a workflow if Strings are silently discarded by the backend? On Wed, Jun 18, 2014 at 10:58 AM, Dmitriy Lyubimov wrote: > Supporting Int and String keys are perhaps minimum set (Long is welcome, > but a second-class citizen) > > supporting of DrmLike[Int] is required for a lot of things (e.g. > Transpose). DrmLike[String] is used in outputs of popular vectorizations in > Mahout such as seq2sparse. > > > On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati wrote: > > > Still incomplete, everything does NOT work. But lots of progress and end > is > > in sight. > > > > - Development happening at > > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still > > doing lots of commit --amend and git push --force as this is my private > > tree. > > > > - Ground level build issues and classloader incompatibilities fixed. > > > > - Can load a matrix into H2O either from in core (through > drmParallelize()) > > or HDFS (parser does not support seqfile yet) > > > > - Only Long type support for Row Keys so far. > > > > - mapBlock() works. This was the trickiest, other ops seem trivial in > > comparison. > > > > Everything else yet to be done. However I will be putting in more time > into > > this over the coming days (was working less than part time on this so > far.) > > > > Questions/comments welcome. > > >
Re: H2O integration - intermediate progress update
Supporting Int and String keys are perhaps minimum set (Long is welcome, but a second-class citizen) supporting of DrmLike[Int] is required for a lot of things (e.g. Transpose). DrmLike[String] is used in outputs of popular vectorizations in Mahout such as seq2sparse. On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati wrote: > Still incomplete, everything does NOT work. But lots of progress and end is > in sight. > > - Development happening at > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still > doing lots of commit --amend and git push --force as this is my private > tree. > > - Ground level build issues and classloader incompatibilities fixed. > > - Can load a matrix into H2O either from in core (through drmParallelize()) > or HDFS (parser does not support seqfile yet) > > - Only Long type support for Row Keys so far. > > - mapBlock() works. This was the trickiest, other ops seem trivial in > comparison. > > Everything else yet to be done. However I will be putting in more time into > this over the coming days (was working less than part time on this so far.) > > Questions/comments welcome. >
Re: H2O integration - intermediate progress update
This, by first looks of it, is seriously cool. I took liberty opening a preview PR just to be able to track your work in that more visible way. All commits you make will be visible there, and all comments anybody makes will be reflected to jira and mailing list. -d On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati wrote: > Still incomplete, everything does NOT work. But lots of progress and end is > in sight. > > - Development happening at > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still > doing lots of commit --amend and git push --force as this is my private > tree. > > - Ground level build issues and classloader incompatibilities fixed. > > - Can load a matrix into H2O either from in core (through drmParallelize()) > or HDFS (parser does not support seqfile yet) > > - Only Long type support for Row Keys so far. > > - mapBlock() works. This was the trickiest, other ops seem trivial in > comparison. > > Everything else yet to be done. However I will be putting in more time into > this over the coming days (was working less than part time on this so far.) > > Questions/comments welcome. >
Re: H2O integration - intermediate progress update
Very cool to hear that! Am 18.06.2014 02:38 schrieb "Ted Dunning" : > Very cool, Anand. > > Very exciting as it makes the multi-engine story make much more sense. > > > On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati wrote: > > > Still incomplete, everything does NOT work. But lots of progress and end > is > > in sight. > > > > - Development happening at > > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still > > doing lots of commit --amend and git push --force as this is my private > > tree. > > > > - Ground level build issues and classloader incompatibilities fixed. > > > > - Can load a matrix into H2O either from in core (through > drmParallelize()) > > or HDFS (parser does not support seqfile yet) > > > > - Only Long type support for Row Keys so far. > > > > - mapBlock() works. This was the trickiest, other ops seem trivial in > > comparison. > > > > Everything else yet to be done. However I will be putting in more time > into > > this over the coming days (was working less than part time on this so > far.) > > > > Questions/comments welcome. > > >
Re: H2O integration - intermediate progress update
Very cool, Anand. Very exciting as it makes the multi-engine story make much more sense. On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati wrote: > Still incomplete, everything does NOT work. But lots of progress and end is > in sight. > > - Development happening at > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still > doing lots of commit --amend and git push --force as this is my private > tree. > > - Ground level build issues and classloader incompatibilities fixed. > > - Can load a matrix into H2O either from in core (through drmParallelize()) > or HDFS (parser does not support seqfile yet) > > - Only Long type support for Row Keys so far. > > - mapBlock() works. This was the trickiest, other ops seem trivial in > comparison. > > Everything else yet to be done. However I will be putting in more time into > this over the coming days (was working less than part time on this so far.) > > Questions/comments welcome. >