Re: Codebase refactoring proposal

Pat Ferrel Tue, 03 Feb 2015 07:21:01 -0800

BTW what level of difficulty would making the DSL run on MLlib Vectors and 
RowMatrix be? Looking at using their hashing TF-IDF but it raises impedance 
mismatch between DRM and MLlib RowMatrix. This would further reduce artifact 
size by a bunch.


Also backing something like a DRM with DStreams. Periodic model recalc with 
streams is maybe the first step towards truly streaming algos. Looking at 
DStream -> DRM conversion for A’A, A’B, and AA’ in item and row similarity. 
Attach Kafka and get evergreen models, if not incrementally updating models.

On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

bottom line compile-time dependencies are satisfied with no extra stuff
from mr-legacy or its transitives. This is proven by virtue of  successful
compilation with no dependency on mr-legacy on the tree.

Runtime sufficiency for no extra dependency is proven via running shell or
embedded tests (unit tests) which are successful too. This implies
embedding and shell apis.

Issue with guava is typical one. if it were an issue, i wouldn't be able to
compile and/or run stuff. Now, question is what do we do if drivers want
extra stuff that is not found in Spark.

Now, It is so nice not to depend on anything extra so i am hesitant to
offer anything  here. either shading or lib with opt-in dependency policy
would suffice though, since it doesn't look like we'd have to have tons of
extra for drivers.



On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I vaguely remember there being a Guava version problem where the version
> had to be rolled back in one of the hadoop modules. The math-scala
> IndexedDataset shouldn’t care about version.
> 
> BTW It seems pretty easy to take out the option parser and replace with
> match and tuples especially if we can extend the Scala App class. It might
> actually simplify things since I can then use several case classes to hold
> options (scopt needed one object), which in turn takes out all those ugly
> casts. I’ll take a look next time I’m in there.
> 
> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> in 'spark' module it is overwritten with spark dependency, which also comes
> at the same version so happens. so should be fine with 1.1.x
> 
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-spark_2.10 ---
> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> [INFO] |  |  |  +-
> commons-configuration:commons-configuration:jar:1.6:compile
> [INFO] |  |  |  |  +-
> commons-collections:commons-collections:jar:3.2.1:compile
> [INFO] |  |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> [INFO] |  |  |  |  |  \-
> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> [INFO] |  |  |  |  \-
> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> [INFO] |  |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> [INFO] |  |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> [INFO] |  |  |  |  +-
> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> [INFO] |  |  |  |  |  +-
> 
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +-
> 
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> [INFO] |  |  |  |  |  |  |  +-
> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> [INFO] |  |  |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |  \-
> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |     \-
> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> [INFO] |  |  |  |  |  |     |        \-
> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     |  \-
> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     +-
> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> [INFO] |  |  |  |  |  |     \- org.glassfish:javax.servlet:jar:3.1:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> [INFO] |  |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> [INFO] |  |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
> [INFO] |  |  |  |  |  |  |     \-
> javax.activation:activation:jar:1.1:compile
> [INFO] |  |  |  |  |  |  +-
> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> [INFO] |  |  |  |  |  |  \-
> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> [INFO] |  |  |  |  |  \-
> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> [INFO] |  |  |  |  \-
> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> [INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  \-
> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> [INFO] |  |     \-
> 
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> [INFO] |  |        \-
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> [INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> [INFO] |  |  +-
> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> [INFO] |  |     \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> d
> 
> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
> 
>> looks like it is also requested by mahout-math, wonder what is using it
>> there.
>> 
>> At very least, it needs to be synchronized to the one currently used by
>> spark.
>> 
>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
>> ---
>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>> [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> 
>> 
>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
>> 
>>> Looks like Guava is in Spark.
>>> 
>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>> 
>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
>>> would not be included since I think it was taken from the mrlegacy jar.
>>> 
>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>>> 
>>> ---------- Forwarded message ----------
>>> From: "Pat Ferrel" <p...@occamsmachete.com>
>>> Date: Jan 25, 2015 9:39 AM
>>> Subject: Re: Codebase refactoring proposal
>>> To: <dev@mahout.apache.org>
>>> Cc:
>>> 
>>>> When you get a chance a PR would be good.
>>> 
>>> Yes, it would. And not just for that.
>>> 
>>>> As I understand it you are putting some class jars somewhere in the
>>> classpath. Where? How?
>>>> 
>>> 
>>> /bin/mahout
>>> 
>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>> 'bin/mahout -spark'.)
>>> 
>>> If i interpret current shell code there correctky, legacy path tries to
>>> use
>>> examples assemblies if not packaged, or /lib if packaged. True
> motivation
>>> of that significantly predates 2010 and i suspect only Benson knows
> whole
>>> true intent there.
>>> 
>>> The spark path, which is really a quick hack of the script, tries to get
>>> only selected mahout jars and locally instlalled spark classpath which i
>>> guess is just the shaded spark jar in recent spark releases. It also
>>> apparently tries to include /libs/*, which is never compiled in
> unpackaged
>>> version, and now i think it is a bug it is included  because /libs/* is
>>> apparently legacy packaging, and shouldnt be used  in spark jobs with a
>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>> understand mahout build in all cases.
>>> 
>>> I am not even sure if packaged mahout will work with spark, honestly,
>>> because of the /lib. Never tried that, since i mostly use application
>>> embedding techniques.
>>> 
>>> The same solution may apply to adding external dependencies and removing
>>> the assembly in the Spark module. Which would leave only one major build
>>> issue afaik.
>>>> 
>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>>> 
>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>> defined
>>>> what i want to do in order to gauge if we may want to advance it some
>>> time
>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>> everything
>>>> that is not compile-time dependent. (and a lot of it is thru legacy MR
>>> code
>>>> which we of course don't use).
>>>> 
>>>> Cant say i understand the remaining issues you are talking about
> though.
>>>> 
>>>> If you are talking about compiling lib or shaded assembly, no, this
>>> doesn't
>>>> do anything about it. Although point is, as it stands, the algebra and
>>>> shell don't have any external dependencies but spark and these 4 (5?)
>>>> mahout jars so they technically don't even need an assembly (as
>>>> demonstrated).
>>>> 
>>>> As i said, it seems driver code is the only one that may need some
>>> external
>>>> dependencies, but that's a different scenario from those i am talking
>>>> about. But i am relatively happy with having the first two working
>>> nicely
>>>> at this point.
>>>> 
>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <p...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> +1
>>>>> 
>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
>>> nice
>>>>> to see how you’ve structured that in case we can use the same model to
>>>>> solve the two remaining refactoring issues.
>>>>> 1) external dependencies in the spark module
>>>>> 2) no spark or h2o in the release artifacts.
>>>>> 
>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <squ...@gatech.edu> wrote:
>>>>> 
>>>>> Also +1
>>>>> 
>>>>> iPhone'd
>>>>> 
>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap....@outlook.com>
> wrote:
>>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> 
>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>> 
>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>> Lyubimov
>>>>> <dlie...@gmail.com> </div><div>Date:01/23/2015  6:06 PM  (GMT-05:00)
>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
>>>>> refactoring proposal </div><div>
>>>>>> </div>
>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>> I did quick refactoring and it turns out it only _irrevocably_
> depends
>>> on
>>>>>> the following classes there:
>>>>>> 
>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> and
>>>>> ...
>>>>>> *sigh* o.a.m.common.Pair
>>>>>> 
>>>>>> So  I just dropped those five classes into new a new tiny
>>> mahout-hadoop
>>>>>> module (to signify stuff that is directly relevant to serializing
>>> thigns
>>>>> to
>>>>>> DFS API) and completely removed mrlegacy and its transients from
> spark
>>>>> and
>>>>>> spark-shell dependencies.
>>>>>> 
>>>>>> So non-cli applications (shell scripts and embedded api use) actually
>>>>> only
>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>> course)
>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>> 
>>>>>> This of course still doesn't address driver problems that want to
>>> throw
>>>>>> more stuff into front-end classpath (such as cli parser) but at least
>>> it
>>>>>> renders transitive luggage of mr-legacy (and the size of
>>> worker-shipped
>>>>>> jars) much more tolerable.
>>>>>> 
>>>>>> How does that sound?
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
>

Re: Codebase refactoring proposal

Reply via email to