Re: Codebase refactoring proposal

Pat Ferrel Wed, 04 Feb 2015 08:12:04 -0800

This sound like a great idea but I wonder is we can get rid of Mahout DRM as a 
native format. If we have DataFrames (have they actually renamed SchemaRDD?) 
backed DRMs we ideally don’t need Mahout native DRMs or IndexedDatasets, right? 
This would be a huge step! If we get data interchangeability with MLlib its a 
win. If we get general row and column IDs that follow the data through math, 
its a win. Need to think through how to use a DataFrame in a streaming case, 
probably through some checkpointing of the window DStream—hmm.


On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <ap....@outlook.com> wrote:


On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
> 
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts :)
> 
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an API at 
the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> Seems like seq2sparse would be really easy to replace since it takes text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for use with
>> joins? This would get rid of sequence files completely from the pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at porting 
seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we 
don't have a distributed data structure for strings..Seems like getting a 
DataFrame implemented as Dmitriy mentioned above would take care of this 
problem.

The other issue i'm a little fuzzy on  is the distributed collocation mapping-  
it's a part of the seq2sparse code that I've not spent too much time in.

I think that this would be very worthy effort as well-  I believe seq2sparse is 
a particular strong mahout feature.

I'll start another thread since we're now way off topic from the refactoring 
proposal.
>> 
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>> 
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search but
>> should produce much better similarities than Solr’s “moreLikeThis” and does
>> it for all pairs rather than one at a time.
>> 
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>> 
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com> wrote:
>> 
>> 
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>> me when someone on the Spark list asked about matrix transpose and an MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>>>> 3) integration of Streaming. DStreams support most of the RDD
>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>>>> Andrew, regardless of the above having TF-IDF would be super
>> helpful—row similarity for content/text would benefit greatly.
>>>   I will put a PR up soon.
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>> 
>> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>>> But first I need to do massive fixes and improvements to the distributed
>>>> optimizer itself. Still waiting on green light for that.
>>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:
>>>> 
>>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <p...@occamsmachete.com> wrote:
>>>>>> BTW what level of difficulty would making the DSL run on MLlib Vectors
>>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
>>>>> impedance mismatch between DRM and MLlib RowMatrix. This would further
>>>>> reduce artifact size by a bunch.
>>>>> 
>>>>> Short answer, if it were possible, I'd not bother with Mahout code
>> base at
>>>>> all. The problem is it lacks sufficient flexibility semantics and
>>>>> abstruction. Breeze is indefinitely better in that department but at
>> the
>>>>> time it was sufficiently worse on abstracting interoperability of
>> matrices
>>>>> with different structures. And mllib does not expose breeze.
>>>>> 
>>>>> Looking forward toward hardware acellerated bolt-on work I just must
>> say
>>>>> after reading breeze code for some time I still have much clearer plan
>> how
>>>>> such back hybridization and cost calibration might work with current
>> Mahout
>>>>> math abstractions than with breeze. It is also more in line with my
>> current
>>>>> work tasks.
>>>>> 
>>>>>> Also backing something like a DRM with DStreams. Periodic model recalc
>>>>> with streams is maybe the first step towards truly streaming algos.
>> Looking
>>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
>>>>> similarity. Attach Kafka and get evergreen models, if not incrementally
>>>>> updating models.
>>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> wrote:
>>>>>> bottom line compile-time dependencies are satisfied with no extra
>> stuff
>>>>>> from mr-legacy or its transitives. This is proven by virtue of
>>>>> successful
>>>>>> compilation with no dependency on mr-legacy on the tree.
>>>>>> 
>>>>>> Runtime sufficiency for no extra dependency is proven via running
>> shell
>>>>> or
>>>>>> embedded tests (unit tests) which are successful too. This implies
>>>>>> embedding and shell apis.
>>>>>> 
>>>>>> Issue with guava is typical one. if it were an issue, i wouldn't be
>> able
>>>>> to
>>>>>> compile and/or run stuff. Now, question is what do we do if drivers
>> want
>>>>>> extra stuff that is not found in Spark.
>>>>>> 
>>>>>> Now, It is so nice not to depend on anything extra so i am hesitant to
>>>>>> offer anything  here. either shading or lib with opt-in dependency
>> policy
>>>>>> would suffice though, since it doesn't look like we'd have to have
>> tons
>>>>> of
>>>>>> extra for drivers.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <p...@occamsmachete.com>
>>>>> wrote:
>>>>>>> I vaguely remember there being a Guava version problem where the
>>>>> version
>>>>>>> had to be rolled back in one of the hadoop modules. The math-scala
>>>>>>> IndexedDataset shouldn’t care about version.
>>>>>>> 
>>>>>>> BTW It seems pretty easy to take out the option parser and replace
>> with
>>>>>>> match and tuples especially if we can extend the Scala App class. It
>>>>> might
>>>>>>> actually simplify things since I can then use several case classes to
>>>>> hold
>>>>>>> options (scopt needed one object), which in turn takes out all those
>>>>> ugly
>>>>>>> casts. I’ll take a look next time I’m in there.
>>>>>>> 
>>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>>> wrote:
>>>>>>> in 'spark' module it is overwritten with spark dependency, which also
>>>>> comes
>>>>>>> at the same version so happens. so should be fine with 1.1.x
>>>>>>> 
>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>>>> mahout-spark_2.10 ---
>>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
>>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
>>>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
>>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
>>>>>>> [INFO] |  |  |  +- commons-logging:commons-logging:jar:1.1.3:compile
>>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
>>>>>>> [INFO] |  |  |  +-
>>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
>>>>>>> [INFO] |  |  |  |  +-
>>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
>>>>>>> [INFO] |  |  |  |  +-
>> commons-digester:commons-digester:jar:1.8:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
>>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
>>>>>>> [INFO] |  |  |  +-
>> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>>>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  |  +-
>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
>>>>>>> [INFO] |  |  |  |  |  +-
>>>>>>> 
>>>>>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>> 
>>>>>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |     \-
>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>> [INFO] |  |  |  |  |  |     |        \-
>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     |  \-
>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     +-
>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  |     \-
>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>> [INFO] |  |  |  |  |  |  |     \-
>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>> [INFO] |  |  \-
>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>>>> [INFO] |  |  +-
>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>>>> [INFO] |  |  |  \-
>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  +-
>>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  \-
>>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |     \-
>>>>>>> 
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>>>>> [INFO] |  |        \-
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +-
>>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  \-
>>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |     \-
>>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>>>> d
>>>>>>> 
>>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <dlie...@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> looks like it is also requested by mahout-math, wonder what is using
>>>>> it
>>>>>>>> there.
>>>>>>>> 
>>>>>>>> At very least, it needs to be synchronized to the one currently used
>>>>> by
>>>>>>>> spark.
>>>>>>>> 
>>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>>>> mahout-hadoop
>>>>>>>> ---
>>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>>>> [INFO] +-
>>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <p...@occamsmachete.com>
>>>>>>> wrote:
>>>>>>>>> Looks like Guava is in Spark.
>>>>>>>>> 
>>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <p...@occamsmachete.com>
>>>>> wrote:
>>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>>>>> this
>>>>>>>>> would not be included since I think it was taken from the mrlegacy
>>>>> jar.
>>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>> From: "Pat Ferrel" <p...@occamsmachete.com>
>>>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>>>> To: <dev@mahout.apache.org>
>>>>>>>>> Cc:
>>>>>>>>> 
>>>>>>>>>> When you get a chance a PR would be good.
>>>>>>>>> Yes, it would. And not just for that.
>>>>>>>>> 
>>>>>>>>>> As I understand it you are putting some class jars somewhere in
>> the
>>>>>>>>> classpath. Where? How?
>>>>>>>>> /bin/mahout
>>>>>>>>> 
>>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>>>>> 'bin/mahout -spark'.)
>>>>>>>>> 
>>>>>>>>> If i interpret current shell code there correctky, legacy path
>> tries
>>>>> to
>>>>>>>>> use
>>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
>>>>>>> motivation
>>>>>>>>> of that significantly predates 2010 and i suspect only Benson knows
>>>>>>> whole
>>>>>>>>> true intent there.
>>>>>>>>> 
>>>>>>>>> The spark path, which is really a quick hack of the script, tries
>> to
>>>>> get
>>>>>>>>> only selected mahout jars and locally instlalled spark classpath
>>>>> which i
>>>>>>>>> guess is just the shaded spark jar in recent spark releases. It
>> also
>>>>>>>>> apparently tries to include /libs/*, which is never compiled in
>>>>>>> unpackaged
>>>>>>>>> version, and now i think it is a bug it is included  because
>> /libs/*
>>>>> is
>>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
>>>>> with a
>>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time
>> to
>>>>>>>>> understand mahout build in all cases.
>>>>>>>>> 
>>>>>>>>> I am not even sure if packaged mahout will work with spark,
>> honestly,
>>>>>>>>> because of the /lib. Never tried that, since i mostly use
>> application
>>>>>>>>> embedding techniques.
>>>>>>>>> 
>>>>>>>>> The same solution may apply to adding external dependencies and
>>>>> removing
>>>>>>>>> the assembly in the Spark module. Which would leave only one major
>>>>> build
>>>>>>>>> issue afaik.
>>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>> No, no PR. Only experiment on private. But i believe i
>> sufficiently
>>>>>>>>> defined
>>>>>>>>>> what i want to do in order to gauge if we may want to advance it
>>>>> some
>>>>>>>>> time
>>>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>>>>>>>> everything
>>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
>> legacy
>>>>> MR
>>>>>>>>> code
>>>>>>>>>> which we of course don't use).
>>>>>>>>>> 
>>>>>>>>>> Cant say i understand the remaining issues you are talking about
>>>>>>> though.
>>>>>>>>>> If you are talking about compiling lib or shaded assembly, no,
>> this
>>>>>>>>> doesn't
>>>>>>>>>> do anything about it. Although point is, as it stands, the algebra
>>>>> and
>>>>>>>>>> shell don't have any external dependencies but spark and these 4
>>>>> (5?)
>>>>>>>>>> mahout jars so they technically don't even need an assembly (as
>>>>>>>>>> demonstrated).
>>>>>>>>>> 
>>>>>>>>>> As i said, it seems driver code is the only one that may need some
>>>>>>>>> external
>>>>>>>>>> dependencies, but that's a different scenario from those i am
>>>>> talking
>>>>>>>>>> about. But i am relatively happy with having the first two working
>>>>>>>>> nicely
>>>>>>>>>> at this point.
>>>>>>>>>> 
>>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
>> p...@occamsmachete.com>
>>>>>>>>> wrote:
>>>>>>>>>>> +1
>>>>>>>>>>> 
>>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
>> would
>>>>> be
>>>>>>>>> nice
>>>>>>>>>>> to see how you’ve structured that in case we can use the same
>>>>> model to
>>>>>>>>>>> solve the two remaining refactoring issues.
>>>>>>>>>>> 1) external dependencies in the spark module
>>>>>>>>>>> 2) no spark or h2o in the release artifacts.
>>>>>>>>>>> 
>>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <squ...@gatech.edu>
>>>>> wrote:
>>>>>>>>>>> Also +1
>>>>>>>>>>> 
>>>>>>>>>>> iPhone'd
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap....@outlook.com>
>>>>>>> wrote:
>>>>>>>>>>>> +1
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>>>>> 
>>>>>>>>>>>> <div>-------- Original message --------</div><div>From: Dmitriy
>>>>>>>>> Lyubimov
>>>>>>>>>>> <dlie...@gmail.com> </div><div>Date:01/23/2015  6:06 PM
>>>>> (GMT-05:00)
>>>>>>>>>>> </div><div>To: dev@mahout.apache.org </div><div>Subject:
>> Codebase
>>>>>>>>>>> refactoring proposal </div><div>
>>>>>>>>>>>> </div>
>>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
>>>>>>> depends
>>>>>>>>> on
>>>>>>>>>>>> the following classes there:
>>>>>>>>>>>> 
>>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
>> VarintWritable,
>>>>>>> and
>>>>>>>>>>> ...
>>>>>>>>>>>> *sigh* o.a.m.common.Pair
>>>>>>>>>>>> 
>>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
>>>>>>>>> mahout-hadoop
>>>>>>>>>>>> module (to signify stuff that is directly relevant to
>> serializing
>>>>>>>>> thigns
>>>>>>>>>>> to
>>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients from
>>>>>>> spark
>>>>>>>>>>> and
>>>>>>>>>>>> spark-shell dependencies.
>>>>>>>>>>>> 
>>>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
>>>>> actually
>>>>>>>>>>> only
>>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath,
>> of
>>>>>>>>> course)
>>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
>> mahout-hadoop
>>>>> and
>>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>>>>>>>> 
>>>>>>>>>>>> This of course still doesn't address driver problems that want
>> to
>>>>>>>>> throw
>>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but at
>>>>> least
>>>>>>>>> it
>>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
>>>>>>>>> worker-shipped
>>>>>>>>>>>> jars) much more tolerable.
>>>>>>>>>>>> 
>>>>>>>>>>>> How does that sound?
>>>>>>>>> 
>> 
>>

Re: Codebase refactoring proposal

Reply via email to