I dont know why. I said i didnt see either as a problem. As far as i am
concerned. Had encountered both needs in the past, did not even notice it
was a problem. Both are not relevant to this thread. Not sure. Id suggest
starting a separate thread.
Speaking of my priorities, two biggest problems i
OK, well perhaps those two lines of code (actually I agree, there’s not much
more) can be also applied to TF-IDF and several other algorithms to get a much
higher level or interoperability and keep us from reinventing things when not
necessary. Funny we have type conversions for so many things *
The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
application and conversion back is another line. I actually did that some
time ago. I am sure you can figure the details.
Whether it is worth to retain some commonality, no, it is not worth it
untill there's commonality acro
I completely understand that MLlib lacks anything like the completeness of
Mahout's DSL, I know of no other scalable solution to match. I don’t know how
many times this has to be said. This is something we can all get behind as
*unique* to Mahout.
But I stand by the statement that there should
Pat,
I *just* made a case in this thread explaining that mllib does not have a
single distributed matrix types and that its own methodologies do not
interoperate within itself for that reason. Therefore, it is fundamentally
impossible to be interoperable with mllib since nobody really can define
w
Why aren’t we using linalg.Vector and its siblings? The same could be asked for
linalg.Matrix. If we want to prune dependencies this would help and would also
significantly increase interoperability.
Case-now: I have a real need to cluster items in a CF type input matrix. The
input matrix A’ ha
On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan wrote:
> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
>
> i) should we add the distributed operations to Mahout codebase as it is
> propos
From my own perspective:
I’m not aware of any rule to make all operations agnostic. In fact several
engine specific exceptions are discussed in this long email. We’ve talked about
reduce or join operations that would be difficult to make agnostic without a
lot of knowledge of ALL other engines.
What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?
i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?
ii) should we have [engine]-ml modules (like spark-bi
I took it Gokhan had objections himself, based on his comments. if we are
talking about #62.
He also expressed concerns about computing GSGD but i suspect it can still
be algebraically computed.
On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel wrote:
> BTW Ted and Andrew have both expressed interest
BTW Ted and Andrew have both expressed interest in the distributed aggregation
stuff. It sounds like we are agreeing that non-algebra—computation method type
things can be engine specific.
So does anyone have an objection to Gokhan pushing his PR?
On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov w
On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote:
>
>
>
> My thought was not to bring primitive engine specific aggregetors,
> combiners, etc. into math-scala.
>
Yeah. +1. I would like to support that as an experiment, see where it goes.
Clearly some distributed use cases are simple enough
But also keep in mind that Flink folks are eager to allocate resources for
ML work. So maybe that's the way to work it -- create a DataFrame-based
seq2sparse port and then just hand it off to them to add to either Flink
directly (but with DRM output), or as a part of Mahout.
On Wed, Feb 4, 2015 at
Spark's DataFrame is obviously not agnostic.
I don't believe there's a good way to abstract it. Unfortunately. I think
getting too much into distributed operation abstraction is a bit dangerous.
I think MLI was one project that attempted to do that -- but it did not
take off i guess. or at least
On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
Re: Gokhan's PR post: here are my thoughts but i did not want to post it
there since they are going beyond the scope of that PR's work to chase the
root of the issue.
on quasi-algebraic methods
What is the dilemma here? d
Re: Gokhan's PR post: here are my thoughts but i did not want to post it
there since they are going beyond the scope of that PR's work to chase the
root of the issue.
on quasi-algebraic methods
What is the dilemma here? don't see any.
I already explained that no more tha
t;>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>>>>> [INFO] | | | | | | +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>>&g
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>> > >>>>>>> [INFO] | | | | | | | \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>
[INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile
> > >>>>>>> [INFO] | | | | | | +-
> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> > >>>>>>> [INFO] | | | | | | | \-
> > >>>&
>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> >>>>>>> [INFO] | | | | \-
> >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> >>>>>>> [INFO] | | | \-
>
] | | | \-
>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>> [INFO] | | +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>> [INFO] | | \-
>> org.apa
xedDataset uses Guava. Can’t tell from sure but it sounds like
this
would not be included since I think it was taken from the mrlegacy
jar.
On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov
wrote:
-- Forwarded message --
From: "Pat Ferrel"
Date: Jan 25, 2015 9:39 A
pile
> >>>>> [INFO] | | | \-
> org.apache.curator:curator-client:jar:2.4.0:compile
> >>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> >>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile
> >>>>>
r:8.1.14.v20131031:compile
>>>>> [INFO] | | +-
>>>>>
>>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>> [INFO] | | +-
>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>&
rote:
IndexedDataset uses Guava. Can’t tell from sure but it sounds like
this
would not be included since I think it was taken from the mrlegacy
jar.
On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov
wrote:
---------- Forwarded message ------
From: "Pat Ferrel"
Date: Jan 25, 2015 9:39
ataset uses Guava. Can’t tell from sure but it sounds like
this
would not be included since I think it was taken from the mrlegacy
jar.
On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov
wrote:
-- Forwarded message --
From: "Pat Ferrel"
Date: Jan 25, 2015 9:39 AM
Subje
31031:compile
>>> [INFO] | | \-
> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>> [INFO] | | \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>> [INFO] | +- com.google.guava:guava:jar:16.0:compile
>>> d
>>&
out:mahout-hadoop:jar:1.0-SNAPSHOT
> >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> >> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile
> >> *[INFO] | +- com.google.guava:guava:jar:16.0:compile*
> >> [INFO] | \- com.tdunni
> > [INFO] | | \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > > [INFO] | +- com.google.guava:guava:jar:16.0:compile
> > > d
> > >
> > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov
> > > wrote:
> > >
>
--
From: "Pat Ferrel"
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To:
Cc:
When you get a chance a PR would be good.
Yes, it would. And not just for that.
As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
/bi
0-SNAPSHOT:test
>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>
>>
>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel
> wrote:
>>
>>> Looks like Guava is in Spark.
>>&
test
> > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >
> >
> > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel
> wrote:
> >
> >> Looks like Guava is in Spark.
> >>
ould not be included since I think it was taken from the mrlegacy jar.
>>
>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote:
>>
>> -- Forwarded message --
>> From: "Pat Ferrel"
>> Date: Jan 25, 2015 9:39 AM
>> Subject
would not be included since I think it was taken from the mrlegacy jar.
>>
>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote:
>>
>> -- Forwarded message --
>> From: "Pat Ferrel"
>> Date: Jan 25, 2015 9:39 AM
>> Subject: Re: Code
aset uses Guava. Can’t tell from sure but it sounds like this
> would not be included since I think it was taken from the mrlegacy jar.
>
> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote:
>
> -- Forwarded message --
> From: "Pat Ferrel"
> Date
:
-- Forwarded message --
From: "Pat Ferrel"
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To:
Cc:
> When you get a chance a PR would be good.
Yes, it would. And not just for that.
> As I understand it you are putting some class jars somewhere in the
class
AM
Subject: Re: Codebase refactoring proposal
To:
Cc:
> When you get a chance a PR would be good.
Yes, it would. And not just for that.
> As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
>
/bin/mahout
(Computes 2 different classpaths. See
-- Forwarded message --
From: "Pat Ferrel"
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To:
Cc:
> When you get a chance a PR would be good.
Yes, it would. And not just for that.
>As I understand it you are putting some class jars
When you get a chance a PR would be good. As I understand it you are putting
some class jars somewhere in the classpath. Where? How? The same solution may
apply to adding external dependencies and removing the assembly in the Spark
module. Which would leave only one major build issue afaik.
On
No, no PR. Only experiment on private. But i believe i sufficiently defined
what i want to do in order to gauge if we may want to advance it some time
later. Goal is much lighter dependency for spark code. Eliminate everything
that is not compile-time dependent. (and a lot of it is thru legacy MR c
+1
Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to
see how you’ve structured that in case we can use the same model to solve the
two remaining refactoring issues.
1) external dependencies in the spark module
2) no spark or h2o in the release artifacts.
On Jan 23, 2
Also +1
iPhone'd
> On Jan 23, 2015, at 18:38, Andrew Palumbo wrote:
>
> +1
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
> Original message From: Dmitriy Lyubimov
> Date:01/23/2015 6:06 PM (GMT-05:00)
> To: dev@mahout.apache.org Subject: Codebase refactoring
+1
Sent from my Verizon Wireless 4G LTE smartphone
Original message From: Dmitriy Lyubimov
Date:01/23/2015 6:06 PM (GMT-05:00)
To: dev@mahout.apache.org Subject: Codebase refactoring
proposal
So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it
sorry i meant _without_ mrlegacy on classpath.
On Fri, Jan 23, 2015 at 3:31 PM, Dmitriy Lyubimov wrote:
> And in case anyone wonders yes shell starts and runs test script totally
> fine with mrlegacy dependency on classpath (startup script modified to use
> mahout-hadoop instead) -- both in loc
And in case anyone wonders yes shell starts and runs test script totally
fine with mrlegacy dependency on classpath (startup script modified to use
mahout-hadoop instead) -- both in local and distributed (standalone) mode:
$ MASTER=spark://localho
+1
On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov wrote:
> So right now mahout-spark depends on mr-legacy.
> I did quick refactoring and it turns out it only _irrevocably_ depends on
> the following classes there:
>
> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
46 matches
Mail list logo