---------- Forwarded message ----------
From: "Pat Ferrel" <p...@occamsmachete.com>
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: <dev@mahout.apache.org>
Cc:

> When you get a chance a PR would be good.

Yes, it would. And not just for that.

>As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
>

/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries to use
examples assemblies if not packaged, or /lib if packaged. True motivation
of that significantly predates 2010 and i suspect only Benson knows whole
true intent there.

The spark path, which is really a quick hack of the script, tries to get
only selected mahout jars and locally instlalled spark classpath which i
guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in unpackaged
version, and now i think it is a bug it is included  because /libs/* is
apparently legacy packaging, and shouldnt be used  in spark jobs with a
wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and removing
the assembly in the Spark module. Which would leave only one major build
issue afaik.
>
> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> No, no PR. Only experiment on private. But i believe i sufficiently
defined
> what i want to do in order to gauge if we may want to advance it some time
> later. Goal is much lighter dependency for spark code. Eliminate
everything
> that is not compile-time dependent. (and a lot of it is thru legacy MR
code
> which we of course don't use).
>
> Cant say i understand the remaining issues you are talking about though.
>
> If you are talking about compiling lib or shaded assembly, no, this
doesn't
> do anything about it. Although point is, as it stands, the algebra and
> shell don't have any external dependencies but spark and these 4 (5?)
> mahout jars so they technically don't even need an assembly (as
> demonstrated).
>
> As i said, it seems driver code is the only one that may need some
external
> dependencies, but that's a different scenario from those i am talking
> about. But i am relatively happy with having the first two working nicely
> at this point.
>
> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > +1
> >
> > Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
nice
> > to see how you’ve structured that in case we can use the same model to
> > solve the two remaining refactoring issues.
> > 1) external dependencies in the spark module
> > 2) no spark or h2o in the release artifacts.
> >
> > On Jan 23, 2015, at 6:45 PM, Shannon Quinn <squ...@gatech.edu> wrote:
> >
> > Also +1
> >
> > iPhone'd
> >
> >> On Jan 23, 2015, at 18:38, Andrew Palumbo <ap....@outlook.com> wrote:
> >>
> >> +1
> >>
> >>
> >> Sent from my Verizon Wireless 4G LTE smartphone
> >>
> >> <div>-------- Original message --------</div><div>From: Dmitriy
Lyubimov
> > <dlie...@gmail.com> </div><div>Date:01/23/2015  6:06 PM  (GMT-05:00)
> > </div><div>To: dev@mahout.apache.org </div><div>Subject: Codebase
> > refactoring proposal </div><div>
> >> </div>
> >> So right now mahout-spark depends on mr-legacy.
> >> I did quick refactoring and it turns out it only _irrevocably_ depends
on
> >> the following classes there:
> >>
> >> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
> > ...
> >> *sigh* o.a.m.common.Pair
> >>
> >> So  I just dropped those five classes into new a new tiny mahout-hadoop
> >> module (to signify stuff that is directly relevant to serializing
thigns
> > to
> >> DFS API) and completely removed mrlegacy and its transients from spark
> > and
> >> spark-shell dependencies.
> >>
> >> So non-cli applications (shell scripts and embedded api use) actually
> > only
> >> need spark dependencies (which come from SPARK_HOME classpath, of
course)
> >> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> >> optionally mahout-spark-shell (for running shell)).
> >>
> >> This of course still doesn't address driver problems that want to throw
> >> more stuff into front-end classpath (such as cli parser) but at least
it
> >> renders transitive luggage of mr-legacy (and the size of worker-shipped
> >> jars) much more tolerable.
> >>
> >> How does that sound?
> >
> >
>

Reply via email to