Re: 0xdata interested in contributing

2014-03-16 Thread SriSatish Ambati
Accuracy and speed are important factors in any experiment because of a simple fact of life, time is a non-renewable resource. Researchers, data modelers and data scientists, all have limited time budgets for experimentation. R is slow. R has lots of packages. This is not be taken as a causal infe

Re: 0xdata interested in contributing

2014-03-16 Thread Ted Dunning
On Sun, Mar 16, 2014 at 9:20 AM, Pat Ferrel wrote: > The last I heard from Ted he was proposing new committers who had lots to > contribute. I did not hear from him that this would mean h20 in the roadmap > for all or Mahout. > Indeed. I was discussing what people thought about the contribution

Re: 0xdata interested in contributing

2014-03-16 Thread Pat Ferrel
The last I heard from Ted he was proposing new committers who had lots to contribute. I did not hear from him that this would mean h20 in the roadmap for all or Mahout. This then leaves the question; *What IS in the roadmap?* I can only vote from relative technical ignorance but with some know

Re: 0xdata interested in contributing

2014-03-16 Thread Pat Ferrel
So your Mahout DRM work was targeted for production at your company and was working well but other parts of the project fell through and it didn’t get deployed. Some of it is almost a year old and pretty mature. --This is very good news. You are also saying that the integration model you used

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
Pat, sorry for offtop -- this code is actually about a year old at heart. I was using it to run some custom methods back in my company but I had to largely reshape it to fit Mahout once i got a permission to contribute. So this took a while, but the idea is certainly not new. At least parts of this

Re: 0xdata interested in contributing

2014-03-14 Thread Pat Ferrel
Cool, I’m super excited to see RSJ on Spark integrated into the mainline with Dimitriy’s work. I really really hope that it is seen as important and doesn’t get stalled by committers being demotivated. I had no idea that what I consider the heart of Mahout was so close to being real on Spark.

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
http://www.youtube.com/watch?v=CDP6NayO1yM search to 13:20 On Fri, Mar 14, 2014 at 1:24 PM, Dmitriy Lyubimov wrote: > I am sorry for saying this -- I just feel I am entitled to an opinion. > This is exactly the style of api in Hadoop and Mahout that allows Evan > Sparks to make his very convin

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
I am sorry for saying this -- I just feel I am entitled to an opinion. This is exactly the style of api in Hadoop and Mahout that allows Evan Sparks to make his very convincing talk points. On Fri, Mar 14, 2014 at 12:54 PM, SriSatish Ambati wrote: > H2O's unknown only to become known. All of us

Re: 0xdata interested in contributing

2014-03-14 Thread Ted Dunning
Pat I am not suggesting that we walk away from anything. I am suggesting that we welcome new contributors who have some interesting capabilities. I also suggest that those efforts should be made to work well with existing efforts. Sent from my iPhone > On Mar 14, 2014, at 10:58, Pat Fer

Re: 0xdata interested in contributing

2014-03-14 Thread Tom Kraljevic
[ Please allow me to introduce myself to everyone, I'm Tom Kraljevic with 0xdata. ] Hi Pat, You raise some good questions here, please let me add to the discussion. > If Mahout moves to another faster better execution engine is will do so only > once in the immediate future. This conve

Re: 0xdata interested in contributing

2014-03-14 Thread Ted Dunning
I think comparison would be a great thing. Sent from my iPhone > On Mar 14, 2014, at 10:38, Pat Ferrel wrote: > > Why are we talking about resetting this effort without at least comparing > actual running code and that would put the proof on 0xdata. If it really is > 100x faster then that’s

Re: 0xdata interested in contributing

2014-03-14 Thread SriSatish Ambati
H2O's unknown only to become known. All of us have watched every open source phenomenon, including successful go through that phase. Linux, Apache, Hadoop, even upto recently Spark were all targets of fear and uncertainty. I'm a fan of Spark and Matei's relentless pursuit over the years. Quantitati

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
Well in preliminary discussion i already mentioned how two trends may co-exist at cost of fragmentation of Mahout. It is definitely possible two have 3 sets of "stuff" -- map reduce only, unfiied ML environment with mapping to Spark And Stratoshpere, and set of solvers tightly coupled to 0xdata eng

Re: 0xdata interested in contributing

2014-03-14 Thread Sebastian Schelter
Praise belongs to Dmitriy, The sparkbindings was his work, not mine. On 03/14/2014 06:58 PM, Pat Ferrel wrote: Not at all. The effect on the community is exactly what I’m most worried about. The effect on community will be far worse if we reset based on architectural talk alone. It involves no

Re: 0xdata interested in contributing

2014-03-14 Thread Pat Ferrel
Not at all. The effect on the community is exactly what I’m most worried about. The effect on community will be far worse if we reset based on architectural talk alone. It involves not just Mahout’s community but Spark’s and 0xdata’s. I think people (including me) have underestimated how much y

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
On Fri, Mar 14, 2014 at 10:38 AM, Pat Ferrel wrote: > So with very little work we could have RSJ, Matrix ops, SSVD+PCA running > on Spark in the mainline of Mahout? Honestly? What makes you doubt? There's a unit test there that runs it in local mode. Good benchmarking is what it lacks of course

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
I think you still miss the point. The performance comparison will be a function of performance comparison of Mahout's in core algebra to that of something else. Most likely as it stands it will not be better than 0xdata, but it will not be 100x either and it will be entirely function of the in-c

Re: 0xdata interested in contributing

2014-03-14 Thread Sebastian Schelter
We already have Matrix ops + SSVD running on Spark in the mainline of Mahout. On 03/14/2014 06:38 PM, Pat Ferrel wrote: So with very little work we could have RSJ, Matrix ops, SSVD+PCA running on Spark in the mainline of Mahout? Honestly? I assume this would involve using Dimitriy’s DRM as in

Re: 0xdata interested in contributing

2014-03-14 Thread Sebastian Schelter
We have Matrix ops, SSVD+PCA running on Spark in the mainline of Mahout. On 03/14/2014 06:38 PM, Pat Ferrel wrote: So with very little work we could have RSJ, Matrix ops, SSVD+PCA running on Spark in the mainline of Mahout? Honestly? I assume this would involve using Dimitriy’s DRM as input to

Re: 0xdata interested in contributing

2014-03-14 Thread Pat Ferrel
So with very little work we could have RSJ, Matrix ops, SSVD+PCA running on Spark in the mainline of Mahout? Honestly? I assume this would involve using Dimitriy’s DRM as input to the RSJ? These are the things I use in Mahout (along with kmeans). I know it’s worth very little but I would use th

Re: 0xdata interested in contributing

2014-03-14 Thread Sebastian Schelter
I did a port recently. It doesn't use Dmitriy's DSL, its just the plain spark API. https://gist.github.com/sscdotopen/8314254 On 03/14/2014 06:14 PM, Pat Ferrel wrote: Isn’t there some work on RSJ on Spark? Can we compare that to something 0xdata can “knock off”? On Mar 14, 2014, at 10:08

Re: 0xdata interested in contributing

2014-03-14 Thread Pat Ferrel
Isn’t there some work on RSJ on Spark? Can we compare that to something 0xdata can “knock off”? On Mar 14, 2014, at 10:08 AM, Sebastian Schelter wrote: Dmitriy, I share a lot your concerns expressed here. I hear more complaints about Mahout being too inaccessible and too hard to customize fo

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
On Fri, Mar 14, 2014 at 9:39 AM, Pat Ferrel wrote: > Love the architectural discussion but sometimes the real answers can be > hidden by minutiae. > > Dimitriy is there enough running on Spark to compare to a DRM > implementation on H2O? 0xdata, go ahead and implement DRM on H2O. If "the > proof

Re: 0xdata interested in contributing

2014-03-14 Thread Sebastian Schelter
Dmitriy, I share a lot your concerns expressed here. I hear more complaints about Mahout being too inaccessible and too hard to customize for use cases and inputs more than complaints about it being too slow. I also concur with your analysis that the clear and accessible programming model is

Re: 0xdata interested in contributing

2014-03-14 Thread Cliff Click
On 3/14/2014 12:13 AM, Sebastian Schelter wrote: (1) Efficient execution of iterative programs. In Hadoop, every iteration must be scheduled as a separate job, rereads invariant data and materializes its result to hdfs. Therefore, iterative programs on Hadoop are an order of magnitude slower

Re: 0xdata interested in contributing

2014-03-14 Thread Pat Ferrel
Love the architectural discussion but sometimes the real answers can be hidden by minutiae. Dimitriy is there enough running on Spark to compare to a DRM implementation on H2O? 0xdata, go ahead and implement DRM on H2O. If “the proof is in the pudding” why not compare?. We really ARE betting

Re: 0xdata interested in contributing

2014-03-14 Thread Dmitriy Lyubimov
> I think that the proposal under discussion involves adding a dependency on > a maven released h2o artifact plus a contribution of Mahout translation > layers. These layers would give a sub-class of Matrix (and Vector) which > allow direct control over life span across multiple jobs but would > o

Re: 0xdata interested in contributing

2014-03-14 Thread Ted Dunning
I can answer some of these. If we need more, others can chime in. On Fri, Mar 14, 2014 at 12:13 AM, Sebastian Schelter wrote: > Hi, > > to me one problem is that a couldn't find documentation that gives a > comprehensive picture of the programming and execution model of h2o. Documentation is

Re: 0xdata interested in contributing

2014-03-14 Thread Sebastian Schelter
Hi, to me one problem is that a couldn't find documentation that gives a comprehensive picture of the programming and execution model of h2o. I'd like to get answers to the following questions: Which operators does it offer, how those are combined to create programs and how are those program

Re: 0xdata interested in contributing

2014-03-13 Thread Sri
Dmitriy, H2O is about bringing better algorithms to big data and now to Mahout. Users will be able use, access and extend the sophisticated high precision and richly featured algorithms. We are bit at loss (puzzled) at the comparisons with general purpose computing platforms - our core vision a

Re: 0xdata interested in contributing

2014-03-13 Thread Andrew Musselman
I think this could truly be the year of Mahout on the desktop! :) As to usability, we're building tools to improve the usability -> help -> fixes workflow, including parsers for the IRC history logs and for the user and dev mailing lists, which feed into jobs to automate the FAQs for the project.

Re: 0xdata interested in contributing

2014-03-13 Thread SriSatish Ambati
Nice to meet, Andrew. Thanks for your hand wave. We are enjoying the warmth of a passionate community. I only meant it to emphasize fact that Mahout is the most popular entry point for ml users. There are several sophisticated algorithms in Mahout. Users love them. A fan of quite a few of them: ssv

Re: 0xdata interested in contributing

2014-03-13 Thread Andrew Musselman
Thanks Sri; nice to meet you and thanks for the conversation. When you say "hello world" I presume you're emphasizing that Mahout is a popular entry point for people seeking to join the field, rather than its being simple or easy to pick up. We've been talking about ways to make Mahout easier to

Re: 0xdata interested in contributing

2014-03-13 Thread SriSatish Ambati
Mahout is the hello world of Machine Learning. It's still the first place many new users get exposed to algorithms on big data. Making that experience beautiful, accessible and value-driven will make machine learning ubiquitous and Mahout a movement to rival the success & utility of say, lucene and

Re: 0xdata interested in contributing

2014-03-13 Thread Dmitriy Lyubimov
yes, sorry, i misread it. On Thu, Mar 13, 2014 at 7:01 PM, Severs, Chris wrote: > To be clear, I meant what Ted said here. Mahout has some great stuff in > it that I would love to see merged with H2O. > > > On Mar 13, 2014, at 6:24 PM, "Ted Dunning" > wrote: > > > >> On Thu, Mar 13, 2014 at 6

Re: 0xdata interested in contributing

2014-03-13 Thread Severs, Chris
To be clear, I meant what Ted said here. Mahout has some great stuff in it that I would love to see merged with H2O. > On Mar 13, 2014, at 6:24 PM, "Ted Dunning" wrote: > >> On Thu, Mar 13, 2014 at 6:08 PM, Dmitriy Lyubimov wrote: >> >> At least one of your users here reported it does not m

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
On Thu, Mar 13, 2014 at 6:08 PM, Dmitriy Lyubimov wrote: > At least one of your users here reported it does not make sense to run > Mahout on all this, and at least two of us have trouble seeing how such > disassembly and reassembly might take place. What are your thoughts on > this? > How clearl

Re: 0xdata interested in contributing

2014-03-13 Thread Dmitriy Lyubimov
PS and of course it all sounds like a well rounded project that exceeds current Mahout capabilities (in mapreduce world anyway). So not the least question is why are you seeking integration with Mahout. Clearly that would involve significant effort to do some things Mahout way. So what's the motiva

Re: 0xdata interested in contributing

2014-03-13 Thread Dmitriy Lyubimov
Thank you, Cliff. Those things are pretty much clear. Most of the questions were more along the lines which of those wonderful things you intend to port to Mahout, and how you see these to stitch in with existing Mahout architecture. At least one of your users here reported it does not make sense

Re: 0xdata interested in contributing

2014-03-13 Thread Cliff Click
There have been a lot of questions on the H2O architecture, I hope to answer the top-level ones here. H2O is a fast & flexible engine. We talk about the MapReduce execution flavor, because it's easy to explain, because it covers a lot of ground, and because we're implemented a bunch of dense

Re: 0xdata interested in contributing

2014-03-13 Thread Dmitriy Lyubimov
> > (b) Mahout provides in-core support for matrices, and perhaps, data > > frames, > > to run both in front and back as needed. > > > > Sounds good. And in accord with the 0xdata proposal. > Is there a written form of such proposal touching concrete architectural details? It would help to cut of

Re: 0xdata interested in contributing

2014-03-13 Thread Sebastian Schelter
On 03/13/2014 09:49 PM, Ted Dunning wrote: > >(4) Couple days of work to throw in Stratosphere primitives. > Likewise. If the stratosphere community would like to step up to help with this, I would champion that contribution as well. I'm sure this is well received in the Stratosphere commun

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
On Thu, Mar 13, 2014 at 1:09 PM, Dmitriy Lyubimov wrote: > I guess current phylosophy, as i have been seeing it till this moment: > > (a) Mahout is nothing but a translation layer with respect to backend > primitives. > Well, there ought to be some implementation above this translation layer. W

Re: 0xdata interested in contributing

2014-03-13 Thread Severs, Chris
Hi Pat, I've tried Mahout previously both through the MR apis and later picking and choosing pieces to use inside Scalding. I've also used H2O a bit recently. As an end user, I don't see any reason to use Mahout over H2O right now aside from the nice Matrix and Vector primitives that Mahout prov

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
On Thu, Mar 13, 2014 at 1:09 PM, Dmitriy Lyubimov wrote: > > The proof is in the pudding, I think. The 0xdata team think that they > can > knock out a Mahout matrix and vector data type pretty quickly. They also > think that the SSVD algorithm will follow from that pretty > straightforwardly. >

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
On Thu, Mar 13, 2014 at 1:09 PM, Dmitriy Lyubimov wrote: > > with numerical computing in Mahout because the problems are different. > (To > > my mind, the key problems for numerical computing include: > > > > a) efficient, very fine-grained parallelism (think microseconds) > > > > b) efficient i

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
I don't think that it is Spark vs h2o. They do different things. Spark (and Drill and Tez and Impala and Stratosphere) do what map-reduce wanted to do. H2o does math. And I don't think we are betting our future. I think we are letting some contributors show us their chops and hopefully make Ma

Re: 0xdata interested in contributing

2014-03-13 Thread Dmitriy Lyubimov
Ted, This isn't going to fix what is wrong > with numerical computing in Mahout because the problems are different. (To > my mind, the key problems for numerical computing include: > > a) efficient, very fine-grained parallelism (think microseconds) > > b) efficient in-memory mutable storage

Re: 0xdata interested in contributing

2014-03-13 Thread Pat Ferrel
Has anyone used 0xdata before? They are new to me. If this is betting Mahout’s future on h2o vs Spark, is everyone convinced that’s the right choice? Does Mahout warrant or need its own next gen fast parallel platform? Does this mean supporting something akin to Spark as part of Mahout or is h2o

Re: 0xdata interested in contributing

2014-03-13 Thread Grant Ingersoll
+1. Happy to help get it migrated! On Mar 12, 2014, at 8:44 PM, Ted Dunning wrote: > I have been working with a company named 0xdata to help them contribute > some new software to Mahout. This software will give Mahout the ability to > do highly iterative in-memory mathematical computations o

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
On Thu, Mar 13, 2014 at 2:10 AM, Sebastian Schelter wrote: > @All > > I have one big question regarding h2o (maybe SriSatish can help me with > that). I haven't been able to find a detailed writeup about the execution > model yet, but on first sight it seems like a big aggregation tree to me: > D

Re: 0xdata interested in contributing

2014-03-13 Thread Ted Dunning
On Thu, Mar 13, 2014 at 2:10 AM, Sebastian Schelter wrote: > One idea that I had would be to take h2o's super optimized operators and > marry them with a dataflow system. What do people think of that? > That is exactly what I was suggesting when I said that these systems are complementary.

Re: 0xdata interested in contributing

2014-03-13 Thread Sebastian Schelter
Hi, Lots of questions from my side here. @Ted At first a comment about your point that dataflow systems cannot do efficient in-memory mutable storage: Stratosphere's delta-iterate operator [1] supports iterative dataflows where the solution is held in an in-memory index and updated in every

Re: 0xdata interested in contributing

2014-03-12 Thread Ted Dunning
There have been several comments here and elsewhere that allude to whether providing h2o based components would conflict with Spark or other bases for systems. As I see it, Spark, Stratosphere, Tez and Drill are trying to fix Hadoop's map-reduce layer in various ways. This isn't going to fix what

Re: 0xdata interested in contributing

2014-03-12 Thread Manoj Awasthi
This sounds very exciting and enriching! Look forward to the developments on this front. On Thu, Mar 13, 2014 at 11:16 AM, SriSatish Ambati wrote: > We are excited at the possibilities of this convergence. > > A fan of Mahout 's vision and how it captured the imagination of machine > learning e

Re: 0xdata interested in contributing

2014-03-12 Thread SriSatish Ambati
We are excited at the possibilities of this convergence. A fan of Mahout 's vision and how it captured the imagination of machine learning enthusiasts over the years.. (Still fondly recollect Isabel's spirited talk at ApacheCon years ago!) We found that a real product, hacker and an open source d

Re: 0xdata interested in contributing

2014-03-12 Thread Ted Dunning
Inline. Sent from my iPhone > On Mar 12, 2014, at 18:14, Ray Melton wrote: > >> On 03/12/2014 05:44 PM, Ted Dunning wrote: >> I have been working with a company named 0xdata to help them contribute > ... >> *Required Additional Work* >> >> Sparse matrices >> Linear algebra bindings >> Class-f

Re: 0xdata interested in contributing

2014-03-12 Thread Andrew Musselman
Sounds like a large positive step; looking forward to hearing more! > On Mar 12, 2014, at 5:44 PM, Ted Dunning wrote: > > I have been working with a company named 0xdata to help them contribute > some new software to Mahout. This software will give Mahout the ability to > do highly iterative in

Re: 0xdata interested in contributing

2014-03-12 Thread Ray Melton
On 03/12/2014 05:44 PM, Ted Dunning wrote: I have been working with a company named 0xdata to help them contribute ... *Required Additional Work* Sparse matrices Linear algebra bindings Class-file magic to allow off-the-cuff function definitions What kind of additional work is required in t

0xdata interested in contributing

2014-03-12 Thread Ted Dunning
I have been working with a company named 0xdata to help them contribute some new software to Mahout. This software will give Mahout the ability to do highly iterative in-memory mathematical computations on a cluster or a single machine. This software also comes with high performance distributed im