Re: Codebase refactoring proposal

2015-02-08 Thread Dmitriy Lyubimov
I dont know why. I said i didnt see either as a problem. As far as i am concerned. Had encountered both needs in the past, did not even notice it was a problem. Both are not relevant to this thread. Not sure. Id suggest starting a separate thread. Speaking of my priorities, two biggest problems i

Re: Codebase refactoring proposal

2015-02-08 Thread Pat Ferrel
OK, well perhaps those two lines of code (actually I agree, there’s not much more) can be also applied to TF-IDF and several other algorithms to get a much higher level or interoperability and keep us from reinventing things when not necessary. Funny we have type conversions for so many things *

Re: Codebase refactoring proposal

2015-02-08 Thread Dmitriy Lyubimov
The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans application and conversion back is another line. I actually did that some time ago. I am sure you can figure the details. Whether it is worth to retain some commonality, no, it is not worth it untill there's commonality acro

Re: Codebase refactoring proposal

2015-02-08 Thread Pat Ferrel
I completely understand that MLlib lacks anything like the completeness of Mahout's DSL, I know of no other scalable solution to match. I don’t know how many times this has to be said. This is something we can all get behind as *unique* to Mahout. But I stand by the statement that there should

Re: Codebase refactoring proposal

2015-02-08 Thread Dmitriy Lyubimov
Pat, I *just* made a case in this thread explaining that mllib does not have a single distributed matrix types and that its own methodologies do not interoperate within itself for that reason. Therefore, it is fundamentally impossible to be interoperable with mllib since nobody really can define w

Re: Codebase refactoring proposal

2015-02-08 Thread Pat Ferrel
Why aren’t we using linalg.Vector and its siblings? The same could be asked for linalg.Matrix. If we want to prune dependencies this would help and would also significantly increase interoperability. Case-now: I have a real need to cluster items in a CF type input matrix. The input matrix A’ ha

Re: Codebase refactoring proposal

2015-02-05 Thread Dmitriy Lyubimov
On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan wrote: > What I am saying is that for certain algorithms including both > engine-specific (such as aggregation) and DSL stuff, what is the best way > of handling them? > > i) should we add the distributed operations to Mahout codebase as it is > propos

Re: Codebase refactoring proposal

2015-02-05 Thread Pat Ferrel
From my own perspective: I’m not aware of any rule to make all operations agnostic. In fact several engine specific exceptions are discussed in this long email. We’ve talked about reduce or join operations that would be difficult to make agnostic without a lot of knowledge of ALL other engines.

Re: Codebase refactoring proposal

2015-02-05 Thread Gokhan Capan
What I am saying is that for certain algorithms including both engine-specific (such as aggregation) and DSL stuff, what is the best way of handling them? i) should we add the distributed operations to Mahout codebase as it is proposed in #62? ii) should we have [engine]-ml modules (like spark-bi

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
I took it Gokhan had objections himself, based on his comments. if we are talking about #62. He also expressed concerns about computing GSGD but i suspect it can still be algebraically computed. On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel wrote: > BTW Ted and Andrew have both expressed interest

Re: Codebase refactoring proposal

2015-02-04 Thread Pat Ferrel
BTW Ted and Andrew have both expressed interest in the distributed aggregation stuff. It sounds like we are agreeing that non-algebra—computation method type things can be engine specific. So does anyone have an objection to Gokhan pushing his PR? On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov w

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote: > > > > My thought was not to bring primitive engine specific aggregetors, > combiners, etc. into math-scala. > Yeah. +1. I would like to support that as an experiment, see where it goes. Clearly some distributed use cases are simple enough

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
But also keep in mind that Flink folks are eager to allocate resources for ML work. So maybe that's the way to work it -- create a DataFrame-based seq2sparse port and then just hand it off to them to add to either Flink directly (but with DRM output), or as a part of Mahout. On Wed, Feb 4, 2015 at

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
Spark's DataFrame is obviously not agnostic. I don't believe there's a good way to abstract it. Unfortunately. I think getting too much into distributed operation abstraction is a bit dangerous. I think MLI was one project that attempted to do that -- but it did not take off i guess. or at least

Re: Codebase refactoring proposal

2015-02-04 Thread Andrew Palumbo
On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote: Re: Gokhan's PR post: here are my thoughts but i did not want to post it there since they are going beyond the scope of that PR's work to chase the root of the issue. on quasi-algebraic methods What is the dilemma here? d

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
Re: Gokhan's PR post: here are my thoughts but i did not want to post it there since they are going beyond the scope of that PR's work to chase the root of the issue. on quasi-algebraic methods What is the dilemma here? don't see any. I already explained that no more tha

Re: Codebase refactoring proposal

2015-02-04 Thread Pat Ferrel
t;>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile >>>>>>>&g

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile >>

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
[INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile > > >>>>>>> [INFO] | | | | | | +- > > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile > > >>>>>>> [INFO] | | | | | | | \- > > >>>&

Re: Codebase refactoring proposal

2015-02-04 Thread Suneel Marthi
>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile > >>>>>>> [INFO] | | | | \- > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile > >>>>>>> [INFO] | | | \- >

Re: Codebase refactoring proposal

2015-02-04 Thread Pat Ferrel
] | | | \- >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile >>>>>>> [INFO] | | +- >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile >>>>>>> [INFO] | | \- >> org.apa

Re: Codebase refactoring proposal

2015-02-04 Thread Andrew Palumbo
xedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 A

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
pile > >>>>> [INFO] | | | \- > org.apache.curator:curator-client:jar:2.4.0:compile > >>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile > >>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile > >>>>>

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
r:8.1.14.v20131031:compile >>>>> [INFO] | | +- >>>>> >>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile >>>>> [INFO] | | +- >>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile >>>>&

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
rote: IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: ---------- Forwarded message ------ From: "Pat Ferrel" Date: Jan 25, 2015 9:39

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
ataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subje

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
31031:compile >>> [INFO] | | \- > org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile >>> [INFO] | | \- > org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile >>> [INFO] | +- com.google.guava:guava:jar:16.0:compile >>> d >>&

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
out:mahout-hadoop:jar:1.0-SNAPSHOT > >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* > >> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile > >> *[INFO] | +- com.google.guava:guava:jar:16.0:compile* > >> [INFO] | \- com.tdunni

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
> > [INFO] | | \- > org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile > > > [INFO] | +- com.google.guava:guava:jar:16.0:compile > > > d > > > > > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov > > > wrote: > > > >

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
-- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: When you get a chance a PR would be good. Yes, it would. And not just for that. As I understand it you are putting some class jars somewhere in the classpath. Where? How? /bi

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
0-SNAPSHOT:test >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile >> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile >> >> >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel > wrote: >> >>> Looks like Guava is in Spark. >>&

Re: Codebase refactoring proposal

2015-02-02 Thread Dmitriy Lyubimov
test > > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile > > [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile > > > > > > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel > wrote: > > > >> Looks like Guava is in Spark. > >>

Re: Codebase refactoring proposal

2015-01-31 Thread Pat Ferrel
ould not be included since I think it was taken from the mrlegacy jar. >> >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: >> >> -- Forwarded message -- >> From: "Pat Ferrel" >> Date: Jan 25, 2015 9:39 AM >> Subject

Re: Codebase refactoring proposal

2015-01-30 Thread Dmitriy Lyubimov
would not be included since I think it was taken from the mrlegacy jar. >> >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: >> >> -- Forwarded message -- >> From: "Pat Ferrel" >> Date: Jan 25, 2015 9:39 AM >> Subject: Re: Code

Re: Codebase refactoring proposal

2015-01-30 Thread Dmitriy Lyubimov
aset uses Guava. Can’t tell from sure but it sounds like this > would not be included since I think it was taken from the mrlegacy jar. > > On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: > > -- Forwarded message -- > From: "Pat Ferrel" > Date

Re: Codebase refactoring proposal

2015-01-30 Thread Pat Ferrel
: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: > When you get a chance a PR would be good. Yes, it would. And not just for that. > As I understand it you are putting some class jars somewhere in the class

Re: Codebase refactoring proposal

2015-01-29 Thread Pat Ferrel
AM Subject: Re: Codebase refactoring proposal To: Cc: > When you get a chance a PR would be good. Yes, it would. And not just for that. > As I understand it you are putting some class jars somewhere in the classpath. Where? How? > /bin/mahout (Computes 2 different classpaths. See &#

Re: Codebase refactoring proposal

2015-01-25 Thread Dmitriy Lyubimov
-- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: > When you get a chance a PR would be good. Yes, it would. And not just for that. >As I understand it you are putting some class jars

Re: Codebase refactoring proposal

2015-01-25 Thread Pat Ferrel
When you get a chance a PR would be good. As I understand it you are putting some class jars somewhere in the classpath. Where? How? The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. On

Re: Codebase refactoring proposal

2015-01-24 Thread Dmitriy Lyubimov
No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR c

Re: Codebase refactoring proposal

2015-01-24 Thread Pat Ferrel
+1 Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to see how you’ve structured that in case we can use the same model to solve the two remaining refactoring issues. 1) external dependencies in the spark module 2) no spark or h2o in the release artifacts. On Jan 23, 2

Re: Codebase refactoring proposal

2015-01-23 Thread Shannon Quinn
Also +1 iPhone'd > On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: > > +1 > > > Sent from my Verizon Wireless 4G LTE smartphone > > Original message From: Dmitriy Lyubimov > Date:01/23/2015 6:06 PM (GMT-05:00) > To: dev@mahout.apache.org Subject: Codebase refactoring

RE: Codebase refactoring proposal

2015-01-23 Thread Andrew Palumbo
+1 Sent from my Verizon Wireless 4G LTE smartphone Original message From: Dmitriy Lyubimov Date:01/23/2015 6:06 PM (GMT-05:00) To: dev@mahout.apache.org Subject: Codebase refactoring proposal So right now mahout-spark depends on mr-legacy. I did quick refactoring and it

Re: Codebase refactoring proposal

2015-01-23 Thread Dmitriy Lyubimov
sorry i meant _without_ mrlegacy on classpath. On Fri, Jan 23, 2015 at 3:31 PM, Dmitriy Lyubimov wrote: > And in case anyone wonders yes shell starts and runs test script totally > fine with mrlegacy dependency on classpath (startup script modified to use > mahout-hadoop instead) -- both in loc

Re: Codebase refactoring proposal

2015-01-23 Thread Dmitriy Lyubimov
And in case anyone wonders yes shell starts and runs test script totally fine with mrlegacy dependency on classpath (startup script modified to use mahout-hadoop instead) -- both in local and distributed (standalone) mode: $ MASTER=spark://localho

Re: Codebase refactoring proposal

2015-01-23 Thread Suneel Marthi
+1 On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov wrote: > So right now mahout-spark depends on mr-legacy. > I did quick refactoring and it turns out it only _irrevocably_ depends on > the following classes there: > > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...