On 02/03/2015 12:22 PM, Pat Ferrel wrote:
Some issues WRT lower level Spark integration:
1) interoperability with Spark data. TF-IDF is one example I actually looked 
at. There may be other things we can pick up from their committers since they 
have an abundance.
2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when 
someone on the Spark list asked about matrix transpose and an MLlib committer’s 
answer was something like “why would you want to do that?”. Usually you don’t 
actually execute the transpose but they don’t even support A’A, AA’, or A’B, 
which are core to what I work on. At present you pretty much have to choose 
between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is 
some implicit conversions (ugh, I know). If the DSL could interchange datasets 
with MLlib, people would be pointed to the DSL for all of a bunch of “why would 
you want to do that?” features. MLlib seems to be algorithms, not math.
3) integration of Streaming. DStreams support most of the RDD interface. Doing 
a batch recalc on a moving time window would nearly fall out of DStream backed 
DRMs. This isn’t the same as incremental updates on streaming but it’s a start.

Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster 
compute engines. So we jumped. Now the need is for streaming and especially 
incrementally updated streaming. Seems like we need to address this.

Andrew, regardless of the above having TF-IDF would be super helpful—row 
similarity for content/text would benefit greatly.

   I will put a PR up soon.

On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

But first I need to do massive fixes and improvements to the distributed
optimizer itself. Still waiting on green light for that.
On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:

On Feb 3, 2015 7:20 AM, "Pat Ferrel" <p...@occamsmachete.com> wrote:
BTW what level of difficulty would making the DSL run on MLlib Vectors
and RowMatrix be? Looking at using their hashing TF-IDF but it raises
impedance mismatch between DRM and MLlib RowMatrix. This would further
reduce artifact size by a bunch.

Short answer, if it were possible, I'd not bother with Mahout code base at
all. The problem is it lacks sufficient flexibility semantics and
abstruction. Breeze is indefinitely better in that department but at the
time it was sufficiently worse on abstracting interoperability of matrices
with different structures. And mllib does not expose breeze.

Looking forward toward hardware acellerated bolt-on work I just must say
after reading breeze code for some time I still have much clearer plan how
such back hybridization and cost calibration might work with current Mahout
math abstractions than with breeze. It is also more in line with my current
work tasks.

Also backing something like a DRM with DStreams. Periodic model recalc
with streams is maybe the first step towards truly streaming algos. Looking
at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
similarity. Attach Kafka and get evergreen models, if not incrementally
updating models.
On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

bottom line compile-time dependencies are satisfied with no extra stuff
from mr-legacy or its transitives. This is proven by virtue of
successful
compilation with no dependency on mr-legacy on the tree.

Runtime sufficiency for no extra dependency is proven via running shell
or
embedded tests (unit tests) which are successful too. This implies
embedding and shell apis.

Issue with guava is typical one. if it were an issue, i wouldn't be able
to
compile and/or run stuff. Now, question is what do we do if drivers want
extra stuff that is not found in Spark.

Now, It is so nice not to depend on anything extra so i am hesitant to
offer anything  here. either shading or lib with opt-in dependency policy
would suffice though, since it doesn't look like we'd have to have tons
of
extra for drivers.



On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <p...@occamsmachete.com>
wrote:
I vaguely remember there being a Guava version problem where the
version
had to be rolled back in one of the hadoop modules. The math-scala
IndexedDataset shouldn’t care about version.

BTW It seems pretty easy to take out the option parser and replace with
match and tuples especially if we can extend the Scala App class. It
might
actually simplify things since I can then use several case classes to
hold
options (scopt needed one object), which in turn takes out all those
ugly
casts. I’ll take a look next time I’m in there.

On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:
in 'spark' module it is overwritten with spark dependency, which also
comes
at the same version so happens. so should be fine with 1.1.x

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
mahout-spark_2.10 ---
[INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
[INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
[INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
[INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
[INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
[INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  |  |  +-
commons-configuration:commons-configuration:jar:1.6:compile
[INFO] |  |  |  |  +-
commons-collections:commons-collections:jar:3.2.1:compile
[INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
[INFO] |  |  |  |  |  \-
commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] |  |  |  |  \-
commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
[INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
[INFO] |  |  |  \-
org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
[INFO] |  |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
[INFO] |  |  |  |  +-
org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
[INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
[INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
[INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  |  |  |  |  +-


com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
[INFO] |  |  |  |  |  |  +-


com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
[INFO] |  |  |  |  |  |  |  +-
javax.servlet:javax.servlet-api:jar:3.0.1:compile
[INFO] |  |  |  |  |  |  |  \-
com.sun.jersey:jersey-client:jar:1.9:compile
[INFO] |  |  |  |  |  |  \-
com.sun.jersey:jersey-grizzly2:jar:1.9:compile
[INFO] |  |  |  |  |  |     +-
org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     |  \-
org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     |     \-
org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
[INFO] |  |  |  |  |  |     |        \-
org.glassfish.external:management-api:jar:3.0.0-b012:compile
[INFO] |  |  |  |  |  |     +-
org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     |  \-
org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     +-
org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
[INFO] |  |  |  |  |  |     \-
org.glassfish:javax.servlet:jar:3.1:compile
[INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
[INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
[INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO] |  |  |  |  |  |  +-
org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
[INFO] |  |  |  |  |  |  +-
com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO] |  |  |  |  |  |  |  \-
javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO] |  |  |  |  |  |  |     \-
javax.activation:activation:jar:1.1:compile
[INFO] |  |  |  |  |  |  +-
org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO] |  |  |  |  |  |  \-
org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO] |  |  |  |  |  \-
com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
[INFO] |  |  |  |  \-
org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
[INFO] |  |  |  \-
org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
[INFO] |  |  |  \-
org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
[INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
[INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
[INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] |  |     \- jline:jline:jar:0.9.94:compile
[INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
[INFO] |  |  +-

org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
[INFO] |  |  |  +-
org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
[INFO] |  |  |  \-
org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \-
org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
[INFO] |  |     \-


org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
[INFO] |  |        \-

org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
[INFO] |  +-
org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] |  +-
org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \-
org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
[INFO] |  |     \-
org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
[INFO] |  +- com.google.guava:guava:jar:16.0:compile
d

On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:

looks like it is also requested by mahout-math, wonder what is using
it
there.

At very least, it needs to be synchronized to the one currently used
by
spark.

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
mahout-hadoop
---
[INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
*[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
[INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
*[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
[INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
[INFO] +-
org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile


On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <p...@occamsmachete.com>
wrote:
Looks like Guava is in Spark.

On Jan 29, 2015, at 4:03 PM, Pat Ferrel <p...@occamsmachete.com>
wrote:
IndexedDataset uses Guava. Can’t tell from sure but it sounds like
this
would not be included since I think it was taken from the mrlegacy
jar.
On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:
---------- Forwarded message ----------
From: "Pat Ferrel" <p...@occamsmachete.com>
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: <dev@mahout.apache.org>
Cc:

When you get a chance a PR would be good.
Yes, it would. And not just for that.

As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries
to
use
examples assemblies if not packaged, or /lib if packaged. True
motivation
of that significantly predates 2010 and i suspect only Benson knows
whole
true intent there.

The spark path, which is really a quick hack of the script, tries to
get
only selected mahout jars and locally instlalled spark classpath
which i
guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in
unpackaged
version, and now i think it is a bug it is included  because /libs/*
is
apparently legacy packaging, and shouldnt be used  in spark jobs
with a
wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and
removing
the assembly in the Spark module. Which would leave only one major
build
issue afaik.
On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:
No, no PR. Only experiment on private. But i believe i sufficiently
defined
what i want to do in order to gauge if we may want to advance it
some
time
later. Goal is much lighter dependency for spark code. Eliminate
everything
that is not compile-time dependent. (and a lot of it is thru legacy
MR
code
which we of course don't use).

Cant say i understand the remaining issues you are talking about
though.
If you are talking about compiling lib or shaded assembly, no, this
doesn't
do anything about it. Although point is, as it stands, the algebra
and
shell don't have any external dependencies but spark and these 4
(5?)
mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some
external
dependencies, but that's a different scenario from those i am
talking
about. But i am relatively happy with having the first two working
nicely
at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <p...@occamsmachete.com>
wrote:
+1

Is there a PR? You mention a "tiny mahout-hadoop” module. It would
be
nice
to see how you’ve structured that in case we can use the same
model to
solve the two remaining refactoring issues.
1) external dependencies in the spark module
2) no spark or h2o in the release artifacts.

On Jan 23, 2015, at 6:45 PM, Shannon Quinn <squ...@gatech.edu>
wrote:
Also +1

iPhone'd

On Jan 23, 2015, at 18:38, Andrew Palumbo <ap....@outlook.com>
wrote:
+1


Sent from my Verizon Wireless 4G LTE smartphone

<div>-------- Original message --------</div><div>From: Dmitriy
Lyubimov
<dlie...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
(GMT-05:00)
</div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
refactoring proposal </div><div>
</div>
So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it turns out it only _irrevocably_
depends
on
the following classes there:

MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
and
...
*sigh* o.a.m.common.Pair

So  I just dropped those five classes into new a new tiny
mahout-hadoop
module (to signify stuff that is directly relevant to serializing
thigns
to
DFS API) and completely removed mrlegacy and its transients from
spark
and
spark-shell dependencies.

So non-cli applications (shell scripts and embedded api use)
actually
only
need spark dependencies (which come from SPARK_HOME classpath, of
course)
and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop
and
optionally mahout-spark-shell (for running shell)).

This of course still doesn't address driver problems that want to
throw
more stuff into front-end classpath (such as cli parser) but at
least
it
renders transitive luggage of mr-legacy (and the size of
worker-shipped
jars) much more tolerable.

How does that sound?





Reply via email to