Re: Codebase refactoring proposal

Suneel Marthi Wed, 04 Feb 2015 08:20:38 -0800

I believe they r still debating about renaming SchemaRDD -> Data Frame.  I
must admit Dmitriy had suggested this to me few months ago reusing
SchemaRDD if possible. Dmitriy was right "U told us".


On Wed, Feb 4, 2015 at 11:09 AM, Pat Ferrel <[email protected]> wrote:

> This sound like a great idea but I wonder is we can get rid of Mahout DRM
> as a native format. If we have DataFrames (have they actually renamed
> SchemaRDD?) backed DRMs we ideally don’t need Mahout native DRMs or
> IndexedDatasets, right? This would be a huge step! If we get data
> interchangeability with MLlib its a win. If we get general row and column
> IDs that follow the data through math, its a win. Need to think through how
> to use a DataFrame in a streaming case, probably through some checkpointing
> of the window DStream—hmm.
>
> On Feb 4, 2015, at 7:37 AM, Andrew Palumbo <[email protected]> wrote:
>
>
> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> > I'd suggest to consider this: remember all this talk about
> > language-integrated spark ql being basically dataframe manipulation DSL?
> >
> > so now Spark devs are noticing this generality as well and are actually
> > proposing to rename SchemaRDD into DataFrame and make it mainstream data
> > structure. (my "told you so" moment of sorts :)
> >
> > What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> > DataFrame our two major structures. In particular, standardize on using
> > DataFrame for things that may include non-numerical data and require more
> > grace about column naming and manipulation. Maybe relevant to TF-IDF work
> > when it deals with non-matrix content.
> Sounds like a worthy effort to me.  We'd be basically implementing an API
> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>
> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]> wrote:
> >> Seems like seq2sparse would be really easy to replace since it takes
> text
> >> files to start with, then the whole pipeline could be kept in rdds. The
> >> dictionaries and counts could be either in-memory maps or rdds for use
> with
> >> joins? This would get rid of sequence files completely from the
> pipeline.
> >> Item similarity uses in-memory maps but the plan is to make it more
> >> scalable using joins as an alternative with the same API allowing the
> user
> >> to trade-off footprint for speed.
>
> I think you're right- should be relatively easy.  I've been looking at
> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
> is that we don't have a distributed data structure for strings..Seems like
> getting a DataFrame implemented as Dmitriy mentioned above would take care
> of this problem.
>
> The other issue i'm a little fuzzy on  is the distributed collocation
> mapping-  it's a part of the seq2sparse code that I've not spent too much
> time in.
>
> I think that this would be very worthy effort as well-  I believe
> seq2sparse is a particular strong mahout feature.
>
> I'll start another thread since we're now way off topic from the
> refactoring proposal.
> >>
> >> My use for TF-IDF is for row similarity and would take a DRM (actually
> >> IndexedDataset) and calculate row/doc similarities. It works now but
> only
> >> using LLR. This is OK when thinking of the items as tags or metadata but
> >> for text tokens something like cosine may be better.
> >>
> >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a
> lot
> >> like how CF preferences are downsampled. This would produce an
> sparsified
> >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
> >> terms before row similarity uses cosine. This is not so good for search
> but
> >> should produce much better similarities than Solr’s “moreLikeThis” and
> does
> >> it for all pairs rather than one at a time.
> >>
> >> In any case it can be used to do a create a personalized content-based
> >> recommender or augment a CF recommender with one more indicator type.
> >>
> >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> wrote:
> >>
> >>
> >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
> >>>> Some issues WRT lower level Spark integration:
> >>>> 1) interoperability with Spark data. TF-IDF is one example I actually
> >> looked at. There may be other things we can pick up from their
> committers
> >> since they have an abundance.
> >>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
> >> me when someone on the Spark list asked about matrix transpose and an
> MLlib
> >> committer’s answer was something like “why would you want to do that?”.
> >> Usually you don’t actually execute the transpose but they don’t even
> >> support A’A, AA’, or A’B, which are core to what I work on. At present
> you
> >> pretty much have to choose between MLlib or Mahout for sparse matrix
> stuff.
> >> Maybe a half-way measure is some implicit conversions (ugh, I know). If
> the
> >> DSL could interchange datasets with MLlib, people would be pointed to
> the
> >> DSL for all of a bunch of “why would you want to do that?” features.
> MLlib
> >> seems to be algorithms, not math.
> >>>> 3) integration of Streaming. DStreams support most of the RDD
> >> interface. Doing a batch recalc on a moving time window would nearly
> fall
> >> out of DStream backed DRMs. This isn’t the same as incremental updates
> on
> >> streaming but it’s a start.
> >>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
> >> faster compute engines. So we jumped. Now the need is for streaming and
> >> especially incrementally updated streaming. Seems like we need to
> address
> >> this.
> >>>> Andrew, regardless of the above having TF-IDF would be super
> >> helpful—row similarity for content/text would benefit greatly.
> >>>   I will put a PR up soon.
> >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF
> classes
> >> and Weight interface over from mr-legacy to math-scala. They're
> available
> >> now in spark-shell but won't be after this refactoring.  These still
> >> require dictionary and a frequency count maps to vectorize incoming
> text-
> >> so they're more for use with the old MR seq2sparse and I don't think
> they
> >> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
> >> Hopefully they'll be of some use.
> >>
> >> On Feb 3, 2015, at 8:47 AM, Dmitriy Lyubimov <[email protected]> wrote:
> >>>> But first I need to do massive fixes and improvements to the
> distributed
> >>>> optimizer itself. Still waiting on green light for that.
> >>>> On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" <[email protected]> wrote:
> >>>>
> >>>>> On Feb 3, 2015 7:20 AM, "Pat Ferrel" <[email protected]> wrote:
> >>>>>> BTW what level of difficulty would making the DSL run on MLlib
> Vectors
> >>>>> and RowMatrix be? Looking at using their hashing TF-IDF but it raises
> >>>>> impedance mismatch between DRM and MLlib RowMatrix. This would
> further
> >>>>> reduce artifact size by a bunch.
> >>>>>
> >>>>> Short answer, if it were possible, I'd not bother with Mahout code
> >> base at
> >>>>> all. The problem is it lacks sufficient flexibility semantics and
> >>>>> abstruction. Breeze is indefinitely better in that department but at
> >> the
> >>>>> time it was sufficiently worse on abstracting interoperability of
> >> matrices
> >>>>> with different structures. And mllib does not expose breeze.
> >>>>>
> >>>>> Looking forward toward hardware acellerated bolt-on work I just must
> >> say
> >>>>> after reading breeze code for some time I still have much clearer
> plan
> >> how
> >>>>> such back hybridization and cost calibration might work with current
> >> Mahout
> >>>>> math abstractions than with breeze. It is also more in line with my
> >> current
> >>>>> work tasks.
> >>>>>
> >>>>>> Also backing something like a DRM with DStreams. Periodic model
> recalc
> >>>>> with streams is maybe the first step towards truly streaming algos.
> >> Looking
> >>>>> at DStream -> DRM conversion for A’A, A’B, and AA’ in item and row
> >>>>> similarity. Attach Kafka and get evergreen models, if not
> incrementally
> >>>>> updating models.
> >>>>>> On Feb 2, 2015, at 4:54 PM, Dmitriy Lyubimov <[email protected]>
> >> wrote:
> >>>>>> bottom line compile-time dependencies are satisfied with no extra
> >> stuff
> >>>>>> from mr-legacy or its transitives. This is proven by virtue of
> >>>>> successful
> >>>>>> compilation with no dependency on mr-legacy on the tree.
> >>>>>>
> >>>>>> Runtime sufficiency for no extra dependency is proven via running
> >> shell
> >>>>> or
> >>>>>> embedded tests (unit tests) which are successful too. This implies
> >>>>>> embedding and shell apis.
> >>>>>>
> >>>>>> Issue with guava is typical one. if it were an issue, i wouldn't be
> >> able
> >>>>> to
> >>>>>> compile and/or run stuff. Now, question is what do we do if drivers
> >> want
> >>>>>> extra stuff that is not found in Spark.
> >>>>>>
> >>>>>> Now, It is so nice not to depend on anything extra so i am hesitant
> to
> >>>>>> offer anything  here. either shading or lib with opt-in dependency
> >> policy
> >>>>>> would suffice though, since it doesn't look like we'd have to have
> >> tons
> >>>>> of
> >>>>>> extra for drivers.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Jan 31, 2015 at 10:17 AM, Pat Ferrel <[email protected]
> >
> >>>>> wrote:
> >>>>>>> I vaguely remember there being a Guava version problem where the
> >>>>> version
> >>>>>>> had to be rolled back in one of the hadoop modules. The math-scala
> >>>>>>> IndexedDataset shouldn’t care about version.
> >>>>>>>
> >>>>>>> BTW It seems pretty easy to take out the option parser and replace
> >> with
> >>>>>>> match and tuples especially if we can extend the Scala App class.
> It
> >>>>> might
> >>>>>>> actually simplify things since I can then use several case classes
> to
> >>>>> hold
> >>>>>>> options (scopt needed one object), which in turn takes out all
> those
> >>>>> ugly
> >>>>>>> casts. I’ll take a look next time I’m in there.
> >>>>>>>
> >>>>>>> On Jan 30, 2015, at 4:07 PM, Dmitriy Lyubimov <[email protected]>
> >>>>> wrote:
> >>>>>>> in 'spark' module it is overwritten with spark dependency, which
> also
> >>>>> comes
> >>>>>>> at the same version so happens. so should be fine with 1.1.x
> >>>>>>>
> >>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>>>>>> mahout-spark_2.10 ---
> >>>>>>> [INFO] org.apache.mahout:mahout-spark_2.10:jar:1.0-SNAPSHOT
> >>>>>>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0:compile
> >>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  +- commons-cli:commons-cli:jar:1.2:compile
> >>>>>>> [INFO] |  |  |  +- org.apache.commons:commons-math:jar:2.1:compile
> >>>>>>> [INFO] |  |  |  +- commons-io:commons-io:jar:2.4:compile
> >>>>>>> [INFO] |  |  |  +-
> commons-logging:commons-logging:jar:1.1.3:compile
> >>>>>>> [INFO] |  |  |  +- commons-lang:commons-lang:jar:2.6:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>>>> commons-configuration:commons-configuration:jar:1.6:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >>>>>>> commons-collections:commons-collections:jar:3.2.1:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >> commons-digester:commons-digester:jar:1.8:compile
> >>>>>>> [INFO] |  |  |  |  |  \-
> >>>>>>> commons-beanutils:commons-beanutils:jar:1.7.0:compile
> >>>>>>> [INFO] |  |  |  |  \-
> >>>>>>> commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> >>>>>>> [INFO] |  |  |  +- org.apache.avro:avro:jar:1.7.4:compile
> >>>>>>> [INFO] |  |  |  +-
> >> com.google.protobuf:protobuf-java:jar:2.5.0:compile
> >>>>>>> [INFO] |  |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
> >>>>>>> [INFO] |  |  |     \- org.tukaani:xz:jar:1.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> >>>>>>> [INFO] |  |  |  |  |  +-
> >>>>>>>
> >>>>>>>
> >>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>>>
> >>>>>>>
> >>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  +-
> >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |     +-
> >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |  \-
> >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |     \-
> >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |        \-
> >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> >>>>>>> [INFO] |  |  |  |  |  |     +-
> >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     |  \-
> >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     +-
> >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |     \-
> >>>>> org.glassfish:javax.servlet:jar:3.1:compile
> >>>>>>> [INFO] |  |  |  |  |  +-
> com.sun.jersey:jersey-server:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >> com.sun.jersey:jersey-core:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |     \-
> >>>>>>> javax.activation:activation:jar:1.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> >>>>>>> [INFO] |  |  |  |  |  \-
> >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  \-
> >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>>
> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  \-
> >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> >>>>>>> [INFO] |  |  \-
> commons-httpclient:commons-httpclient:jar:3.1:compile
> >>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> >>>>>>> [INFO] |  |  +-
> >> org.apache.curator:curator-framework:jar:2.4.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >> org.apache.curator:curator-client:jar:2.4.0:compile
> >>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> >>>>>>> [INFO] |  |     \- jline:jline:jar:0.9.94:compile
> >>>>>>> [INFO] |  +-
> >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>>
> >>
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> >>>>>>> [INFO] |  |  +-
> >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  \-
> >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |     \-
> >>>>>>>
> >>>>>>>
> >>
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> >>>>>>> [INFO] |  |        \-
> >>>>>>>
> >> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> >>>>>>> [INFO] |  +-
> >>>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  +-
> >> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  +-
> >>>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>>
> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |  \-
> >>>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  |     \-
> >>>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> >>>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> >>>>>>> d
> >>>>>>>
> >>>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov <
> [email protected]
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> looks like it is also requested by mahout-math, wonder what is
> using
> >>>>> it
> >>>>>>>> there.
> >>>>>>>>
> >>>>>>>> At very least, it needs to be synchronized to the one currently
> used
> >>>>> by
> >>>>>>>> spark.
> >>>>>>>>
> >>>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>>>> mahout-hadoop
> >>>>>>>> ---
> >>>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> >>>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> >>>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> >>>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> >>>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> >>>>>>>> [INFO] +-
> >>>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> >>>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >>>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>> Looks like Guava is in Spark.
> >>>>>>>>>
> >>>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel <[email protected]>
> >>>>> wrote:
> >>>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds
> like
> >>>>> this
> >>>>>>>>> would not be included since I think it was taken from the
> mrlegacy
> >>>>> jar.
> >>>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>> ---------- Forwarded message ----------
> >>>>>>>>> From: "Pat Ferrel" <[email protected]>
> >>>>>>>>> Date: Jan 25, 2015 9:39 AM
> >>>>>>>>> Subject: Re: Codebase refactoring proposal
> >>>>>>>>> To: <[email protected]>
> >>>>>>>>> Cc:
> >>>>>>>>>
> >>>>>>>>>> When you get a chance a PR would be good.
> >>>>>>>>> Yes, it would. And not just for that.
> >>>>>>>>>
> >>>>>>>>>> As I understand it you are putting some class jars somewhere in
> >> the
> >>>>>>>>> classpath. Where? How?
> >>>>>>>>> /bin/mahout
> >>>>>>>>>
> >>>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> >>>>>>>>> 'bin/mahout -spark'.)
> >>>>>>>>>
> >>>>>>>>> If i interpret current shell code there correctky, legacy path
> >> tries
> >>>>> to
> >>>>>>>>> use
> >>>>>>>>> examples assemblies if not packaged, or /lib if packaged. True
> >>>>>>> motivation
> >>>>>>>>> of that significantly predates 2010 and i suspect only Benson
> knows
> >>>>>>> whole
> >>>>>>>>> true intent there.
> >>>>>>>>>
> >>>>>>>>> The spark path, which is really a quick hack of the script, tries
> >> to
> >>>>> get
> >>>>>>>>> only selected mahout jars and locally instlalled spark classpath
> >>>>> which i
> >>>>>>>>> guess is just the shaded spark jar in recent spark releases. It
> >> also
> >>>>>>>>> apparently tries to include /libs/*, which is never compiled in
> >>>>>>> unpackaged
> >>>>>>>>> version, and now i think it is a bug it is included  because
> >> /libs/*
> >>>>> is
> >>>>>>>>> apparently legacy packaging, and shouldnt be used  in spark jobs
> >>>>> with a
> >>>>>>>>> wildcard. I cant beleive how lazy i am, i still did not find time
> >> to
> >>>>>>>>> understand mahout build in all cases.
> >>>>>>>>>
> >>>>>>>>> I am not even sure if packaged mahout will work with spark,
> >> honestly,
> >>>>>>>>> because of the /lib. Never tried that, since i mostly use
> >> application
> >>>>>>>>> embedding techniques.
> >>>>>>>>>
> >>>>>>>>> The same solution may apply to adding external dependencies and
> >>>>> removing
> >>>>>>>>> the assembly in the Spark module. Which would leave only one
> major
> >>>>> build
> >>>>>>>>> issue afaik.
> >>>>>>>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <
> [email protected]
> >>>>>>>>> wrote:
> >>>>>>>>>> No, no PR. Only experiment on private. But i believe i
> >> sufficiently
> >>>>>>>>> defined
> >>>>>>>>>> what i want to do in order to gauge if we may want to advance it
> >>>>> some
> >>>>>>>>> time
> >>>>>>>>>> later. Goal is much lighter dependency for spark code. Eliminate
> >>>>>>>>> everything
> >>>>>>>>>> that is not compile-time dependent. (and a lot of it is thru
> >> legacy
> >>>>> MR
> >>>>>>>>> code
> >>>>>>>>>> which we of course don't use).
> >>>>>>>>>>
> >>>>>>>>>> Cant say i understand the remaining issues you are talking about
> >>>>>>> though.
> >>>>>>>>>> If you are talking about compiling lib or shaded assembly, no,
> >> this
> >>>>>>>>> doesn't
> >>>>>>>>>> do anything about it. Although point is, as it stands, the
> algebra
> >>>>> and
> >>>>>>>>>> shell don't have any external dependencies but spark and these 4
> >>>>> (5?)
> >>>>>>>>>> mahout jars so they technically don't even need an assembly (as
> >>>>>>>>>> demonstrated).
> >>>>>>>>>>
> >>>>>>>>>> As i said, it seems driver code is the only one that may need
> some
> >>>>>>>>> external
> >>>>>>>>>> dependencies, but that's a different scenario from those i am
> >>>>> talking
> >>>>>>>>>> about. But i am relatively happy with having the first two
> working
> >>>>>>>>> nicely
> >>>>>>>>>> at this point.
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <
> >> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>> +1
> >>>>>>>>>>>
> >>>>>>>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It
> >> would
> >>>>> be
> >>>>>>>>> nice
> >>>>>>>>>>> to see how you’ve structured that in case we can use the same
> >>>>> model to
> >>>>>>>>>>> solve the two remaining refactoring issues.
> >>>>>>>>>>> 1) external dependencies in the spark module
> >>>>>>>>>>> 2) no spark or h2o in the release artifacts.
> >>>>>>>>>>>
> >>>>>>>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn <[email protected]>
> >>>>> wrote:
> >>>>>>>>>>> Also +1
> >>>>>>>>>>>
> >>>>>>>>>>> iPhone'd
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo <[email protected]
> >
> >>>>>>> wrote:
> >>>>>>>>>>>> +1
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
> >>>>>>>>>>>>
> >>>>>>>>>>>> <div>-------- Original message --------</div><div>From:
> Dmitriy
> >>>>>>>>> Lyubimov
> >>>>>>>>>>> <[email protected]> </div><div>Date:01/23/2015  6:06 PM
> >>>>> (GMT-05:00)
> >>>>>>>>>>> </div><div>To: [email protected] </div><div>Subject:
> >> Codebase
> >>>>>>>>>>> refactoring proposal </div><div>
> >>>>>>>>>>>> </div>
> >>>>>>>>>>>> So right now mahout-spark depends on mr-legacy.
> >>>>>>>>>>>> I did quick refactoring and it turns out it only _irrevocably_
> >>>>>>> depends
> >>>>>>>>> on
> >>>>>>>>>>>> the following classes there:
> >>>>>>>>>>>>
> >>>>>>>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and
> >> VarintWritable,
> >>>>>>> and
> >>>>>>>>>>> ...
> >>>>>>>>>>>> *sigh* o.a.m.common.Pair
> >>>>>>>>>>>>
> >>>>>>>>>>>> So  I just dropped those five classes into new a new tiny
> >>>>>>>>> mahout-hadoop
> >>>>>>>>>>>> module (to signify stuff that is directly relevant to
> >> serializing
> >>>>>>>>> thigns
> >>>>>>>>>>> to
> >>>>>>>>>>>> DFS API) and completely removed mrlegacy and its transients
> from
> >>>>>>> spark
> >>>>>>>>>>> and
> >>>>>>>>>>>> spark-shell dependencies.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So non-cli applications (shell scripts and embedded api use)
> >>>>> actually
> >>>>>>>>>>> only
> >>>>>>>>>>>> need spark dependencies (which come from SPARK_HOME classpath,
> >> of
> >>>>>>>>> course)
> >>>>>>>>>>>> and mahout jars (mahout-spark, mahout-math(-scala),
> >> mahout-hadoop
> >>>>> and
> >>>>>>>>>>>> optionally mahout-spark-shell (for running shell)).
> >>>>>>>>>>>>
> >>>>>>>>>>>> This of course still doesn't address driver problems that want
> >> to
> >>>>>>>>> throw
> >>>>>>>>>>>> more stuff into front-end classpath (such as cli parser) but
> at
> >>>>> least
> >>>>>>>>> it
> >>>>>>>>>>>> renders transitive luggage of mr-legacy (and the size of
> >>>>>>>>> worker-shipped
> >>>>>>>>>>>> jars) much more tolerable.
> >>>>>>>>>>>>
> >>>>>>>>>>>> How does that sound?
> >>>>>>>>>
> >>
> >>
>
>
>

Re: Codebase refactoring proposal

Reply via email to