Re: Codebase refactoring proposal

2015-02-08 Thread Dmitriy Lyubimov
I dont know why. I said i didnt see either as a problem. As far as i am
concerned. Had encountered both needs in the past, did not even notice it
was a problem. Both are not relevant to this thread. Not sure. Id suggest
starting a separate thread.

Speaking of my priorities, two biggest problems i see is in-core
performance and tons of archaic dependencies. But only one belongs here.
3rd biggest problem is general bugs and code tidiness.
On Feb 8, 2015 8:22 PM, "Pat Ferrel"  wrote:

> OK, well perhaps those two lines of code (actually I agree, there’s not
> much more) can be also applied to TF-IDF and several other algorithms to
> get a much higher level or interoperability and keep us from reinventing
> things when not necessary. Funny we have type conversions for so many
> things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in
> but it does solve problems we don’t need to reinvent. Frankly adopting the
> best of MLlib makes Mahout a superset along with all its other virtues.
>
> And yes, I forgot to also praise the DSL’s optimizer—now rectified.
>
> Why do we spend more time with engine agnostic decisions that these more
> pragmatic ones?
>
>
> On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov  wrote:
>
> The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
> application and conversion back is another line. I actually did that some
> time ago. I am sure you  can figure the details.
>
> Whether it is worth to retain some commonality, no, it is not worth it
> untill there's commonality across mllib.
>
> At which point we may just include conversions for those who is interested.
> Until  then all we can do is to maintain commonality with mllib kmeans
> specifically but not mllib as a whole.
> On Feb 8, 2015 7:45 PM, "Pat Ferrel"  wrote:
>
> > I completely understand that MLlib lacks anything like the completeness
> of
> > Mahout's DSL, I know of no other scalable solution to match.  I don’t
> know
> > how many times this has to be said. This is something we can all get
> behind
> > as *unique* to Mahout.
> >
> > But I stand by the statement that there should also be some lower level
> > data commonality. There is too much similarity to dismiss and go
> completely
> > non-overlapping ways. Even if you can ague for maintaining separate
> > parallel ways let’s have some type conversions (I hesitate to say easy to
> > use) They shouldn’t be all that hard.
> >
> > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> > solve my Kmeans use case. You know MLlib better than I so choose the best
> > level to perform type conversions or inheritance splicing. The point is
> to
> > make the two as seamless as possible. Doesn’t this seem a worthy goal?
> >
> > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov  wrote:
> >
> > Pat,
> >
> > I *just* made a case in this thread explaining that mllib does not have a
> > single distributed matrix types and that its own methodologies do not
> > interoperate within itself for that reason. Therefore, it is
> fundamentally
> > impossible to be interoperable with mllib since nobody really can define
> > what it means in terms of distributed types.
> >
> > You are in fact referring  to their in-core type, not a distributed type.
> > But there's no linear algebra operation support to speak of there either.
> > It is, simply, not algebra, at the moment. The types in this hierarchy
> are
> > just memory storage models, and private scope converters to breeze
> storage
> > models, but they are not true linalg apis nor providers of such.
> >
> > One might concievably want to standardize on Breeze apis since those are
> > both linalg api and providers, but not the type you've been mentioning.
> >
> > However, it is not a very happy path either. Breeze is somewhat more
> > interesting substrate to build in-core operations on, but if you read
> spark
> > forum of late, even spark developers express a whiff of dissatisfaction
> > with it in favor of BIDMat (me too btw). But while they say Bidmat would
> be
> > a better choice for in-core operatros, they also recognize the fact that
> > they are too invested into breeze api by now and such move would not be
> > cheap across the board.
> >
> > And that demonstrates another problem on in-core mllib architectrue
> there:
> > on one side, they don't have sufficient public in-core dsl or api to
> speak
> > of; but they also do not have a sufficiently abstract api for in-core
> blas
> > plugins either to be truly agnostic of the available in-core
> methodologies.
> >
> > So what you are talking about, is simply not possible with current state
> of
> > things there. But if it were, i'd just suggest you to try to port
> algebraic
> > things you like in Mahout, to mllib.
> >
> > My guess however is that you'd find that porting algebraic optimizer with
> > proper level of consistency with in-core operations will not be easy for
> > reasons including, but not limited to, the ones i just mentioned;
> alth

Re: Codebase refactoring proposal

2015-02-08 Thread Pat Ferrel
OK, well perhaps those two lines of code (actually I agree, there’s not much 
more) can be also applied to TF-IDF and several other algorithms to get a much 
higher level or interoperability and keep us from reinventing things when not 
necessary. Funny we have type conversions for so many things *but* MLlib. I’ve 
been arguing for what a uneven state MLlib is in but it does solve problems we 
don’t need to reinvent. Frankly adopting the best of MLlib makes Mahout a 
superset along with all its other virtues.  

And yes, I forgot to also praise the DSL’s optimizer—now rectified.

Why do we spend more time with engine agnostic decisions that these more 
pragmatic ones?

 
On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov  wrote:

The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
application and conversion back is another line. I actually did that some
time ago. I am sure you  can figure the details.

Whether it is worth to retain some commonality, no, it is not worth it
untill there's commonality across mllib.

At which point we may just include conversions for those who is interested.
Until  then all we can do is to maintain commonality with mllib kmeans
specifically but not mllib as a whole.
On Feb 8, 2015 7:45 PM, "Pat Ferrel"  wrote:

> I completely understand that MLlib lacks anything like the completeness of
> Mahout's DSL, I know of no other scalable solution to match.  I don’t know
> how many times this has to be said. This is something we can all get behind
> as *unique* to Mahout.
> 
> But I stand by the statement that there should also be some lower level
> data commonality. There is too much similarity to dismiss and go completely
> non-overlapping ways. Even if you can ague for maintaining separate
> parallel ways let’s have some type conversions (I hesitate to say easy to
> use) They shouldn’t be all that hard.
> 
> A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> solve my Kmeans use case. You know MLlib better than I so choose the best
> level to perform type conversions or inheritance splicing. The point is to
> make the two as seamless as possible. Doesn’t this seem a worthy goal?
> 
> On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov  wrote:
> 
> Pat,
> 
> I *just* made a case in this thread explaining that mllib does not have a
> single distributed matrix types and that its own methodologies do not
> interoperate within itself for that reason. Therefore, it is fundamentally
> impossible to be interoperable with mllib since nobody really can define
> what it means in terms of distributed types.
> 
> You are in fact referring  to their in-core type, not a distributed type.
> But there's no linear algebra operation support to speak of there either.
> It is, simply, not algebra, at the moment. The types in this hierarchy are
> just memory storage models, and private scope converters to breeze storage
> models, but they are not true linalg apis nor providers of such.
> 
> One might concievably want to standardize on Breeze apis since those are
> both linalg api and providers, but not the type you've been mentioning.
> 
> However, it is not a very happy path either. Breeze is somewhat more
> interesting substrate to build in-core operations on, but if you read spark
> forum of late, even spark developers express a whiff of dissatisfaction
> with it in favor of BIDMat (me too btw). But while they say Bidmat would be
> a better choice for in-core operatros, they also recognize the fact that
> they are too invested into breeze api by now and such move would not be
> cheap across the board.
> 
> And that demonstrates another problem on in-core mllib architectrue  there:
> on one side, they don't have sufficient public in-core dsl or api to speak
> of; but they also do not have a sufficiently abstract api for in-core blas
> plugins either to be truly agnostic of the available in-core methodologies.
> 
> So what you are talking about, is simply not possible with current state of
> things there. But if it were, i'd just suggest you to try to port algebraic
> things you like in Mahout, to mllib.
> 
> My guess however is that you'd find that porting algebraic optimizer with
> proper level of consistency with in-core operations will not be easy for
> reasons including, but not limited to, the ones i just mentioned; although
> individual blas  like matrix square you've mentioned would be fairly easy
> to do for one of the distributed matrix types in mllib. But that of course
> would not be an R like environment and not an optimizer.
> 
> I like bidmat a lot though; but it is not truly hybrid and self-adjusting
> environment for in-core operations either (and its dsl is neither Rlike nor
> matlab like, so it takes a bit of adjusting to). For that reason even
> Bidmat linalg types and dsl are not truly versatile enough for our (well,
> my anyway) purposes (which are to find the best hardware or software
> subroutine automatically given current hardware and software pl

Re: Codebase refactoring proposal

2015-02-08 Thread Dmitriy Lyubimov
The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
application and conversion back is another line. I actually did that some
time ago. I am sure you  can figure the details.

Whether it is worth to retain some commonality, no, it is not worth it
untill there's commonality across mllib.

At which point we may just include conversions for those who is interested.
Until  then all we can do is to maintain commonality with mllib kmeans
specifically but not mllib as a whole.
On Feb 8, 2015 7:45 PM, "Pat Ferrel"  wrote:

> I completely understand that MLlib lacks anything like the completeness of
> Mahout's DSL, I know of no other scalable solution to match.  I don’t know
> how many times this has to be said. This is something we can all get behind
> as *unique* to Mahout.
>
> But I stand by the statement that there should also be some lower level
> data commonality. There is too much similarity to dismiss and go completely
> non-overlapping ways. Even if you can ague for maintaining separate
> parallel ways let’s have some type conversions (I hesitate to say easy to
> use) They shouldn’t be all that hard.
>
> A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> solve my Kmeans use case. You know MLlib better than I so choose the best
> level to perform type conversions or inheritance splicing. The point is to
> make the two as seamless as possible. Doesn’t this seem a worthy goal?
>
> On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov  wrote:
>
> Pat,
>
> I *just* made a case in this thread explaining that mllib does not have a
> single distributed matrix types and that its own methodologies do not
> interoperate within itself for that reason. Therefore, it is fundamentally
> impossible to be interoperable with mllib since nobody really can define
> what it means in terms of distributed types.
>
> You are in fact referring  to their in-core type, not a distributed type.
> But there's no linear algebra operation support to speak of there either.
> It is, simply, not algebra, at the moment. The types in this hierarchy are
> just memory storage models, and private scope converters to breeze storage
> models, but they are not true linalg apis nor providers of such.
>
> One might concievably want to standardize on Breeze apis since those are
> both linalg api and providers, but not the type you've been mentioning.
>
> However, it is not a very happy path either. Breeze is somewhat more
> interesting substrate to build in-core operations on, but if you read spark
> forum of late, even spark developers express a whiff of dissatisfaction
> with it in favor of BIDMat (me too btw). But while they say Bidmat would be
> a better choice for in-core operatros, they also recognize the fact that
> they are too invested into breeze api by now and such move would not be
> cheap across the board.
>
> And that demonstrates another problem on in-core mllib architectrue  there:
> on one side, they don't have sufficient public in-core dsl or api to speak
> of; but they also do not have a sufficiently abstract api for in-core blas
> plugins either to be truly agnostic of the available in-core methodologies.
>
> So what you are talking about, is simply not possible with current state of
> things there. But if it were, i'd just suggest you to try to port algebraic
> things you like in Mahout, to mllib.
>
> My guess however is that you'd find that porting algebraic optimizer with
> proper level of consistency with in-core operations will not be easy for
> reasons including, but not limited to, the ones i just mentioned; although
> individual blas  like matrix square you've mentioned would be fairly easy
> to do for one of the distributed matrix types in mllib. But that of course
> would not be an R like environment and not an optimizer.
>
> I like bidmat a lot though; but it is not truly hybrid and self-adjusting
> environment for in-core operations either (and its dsl is neither Rlike nor
> matlab like, so it takes a bit of adjusting to). For that reason even
> Bidmat linalg types and dsl are not truly versatile enough for our (well,
> my anyway) purposes (which are to find the best hardware or software
> subroutine automatically given current hardware and software platform
> architecture and parameters of the requested operation).
> On Feb 8, 2015 9:05 AM, "Pat Ferrel"  wrote:
>
> > Why aren’t we using linalg.Vector and its siblings? The same could be
> > asked for linalg.Matrix. If we want to prune dependencies this would help
> > and would also significantly increase interoperability.
> >
> > Case-now: I have a real need to cluster items in a CF type input matrix.
> > The input matrix A’ has row of items. I need to drop this into a sequence
> > file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> > RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> > and maybe could be helped with some implicit conversions mahout.Vector
> <->
> > linalg.Vector (mayb

Re: Codebase refactoring proposal

2015-02-08 Thread Pat Ferrel
I completely understand that MLlib lacks anything like the completeness of 
Mahout's DSL, I know of no other scalable solution to match.  I don’t know how 
many times this has to be said. This is something we can all get behind as 
*unique* to Mahout.

But I stand by the statement that there should also be some lower level data 
commonality. There is too much similarity to dismiss and go completely 
non-overlapping ways. Even if you can ague for maintaining separate parallel 
ways let’s have some type conversions (I hesitate to say easy to use) They 
shouldn’t be all that hard.

A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would solve 
my Kmeans use case. You know MLlib better than I so choose the best level to 
perform type conversions or inheritance splicing. The point is to make the two 
as seamless as possible. Doesn’t this seem a worthy goal?
 
On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov  wrote:

Pat,

I *just* made a case in this thread explaining that mllib does not have a
single distributed matrix types and that its own methodologies do not
interoperate within itself for that reason. Therefore, it is fundamentally
impossible to be interoperable with mllib since nobody really can define
what it means in terms of distributed types.

You are in fact referring  to their in-core type, not a distributed type.
But there's no linear algebra operation support to speak of there either.
It is, simply, not algebra, at the moment. The types in this hierarchy are
just memory storage models, and private scope converters to breeze storage
models, but they are not true linalg apis nor providers of such.

One might concievably want to standardize on Breeze apis since those are
both linalg api and providers, but not the type you've been mentioning.

However, it is not a very happy path either. Breeze is somewhat more
interesting substrate to build in-core operations on, but if you read spark
forum of late, even spark developers express a whiff of dissatisfaction
with it in favor of BIDMat (me too btw). But while they say Bidmat would be
a better choice for in-core operatros, they also recognize the fact that
they are too invested into breeze api by now and such move would not be
cheap across the board.

And that demonstrates another problem on in-core mllib architectrue  there:
on one side, they don't have sufficient public in-core dsl or api to speak
of; but they also do not have a sufficiently abstract api for in-core blas
plugins either to be truly agnostic of the available in-core methodologies.

So what you are talking about, is simply not possible with current state of
things there. But if it were, i'd just suggest you to try to port algebraic
things you like in Mahout, to mllib.

My guess however is that you'd find that porting algebraic optimizer with
proper level of consistency with in-core operations will not be easy for
reasons including, but not limited to, the ones i just mentioned; although
individual blas  like matrix square you've mentioned would be fairly easy
to do for one of the distributed matrix types in mllib. But that of course
would not be an R like environment and not an optimizer.

I like bidmat a lot though; but it is not truly hybrid and self-adjusting
environment for in-core operations either (and its dsl is neither Rlike nor
matlab like, so it takes a bit of adjusting to). For that reason even
Bidmat linalg types and dsl are not truly versatile enough for our (well,
my anyway) purposes (which are to find the best hardware or software
subroutine automatically given current hardware and software platform
architecture and parameters of the requested operation).
On Feb 8, 2015 9:05 AM, "Pat Ferrel"  wrote:

> Why aren’t we using linalg.Vector and its siblings? The same could be
> asked for linalg.Matrix. If we want to prune dependencies this would help
> and would also significantly increase interoperability.
> 
> Case-now: I have a real need to cluster items in a CF type input matrix.
> The input matrix A’ has row of items. I need to drop this into a sequence
> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> and maybe could be helped with some implicit conversions mahout.Vector <->
> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> Kmeans).
> 
> Case-possible: If we adopted linalg.Vector as the native format and
> perhaps even linalg.Matrix this would give immediate interoperability in
> some areas including my specific need. It would significantly pare down
> dependencies not provided by the environment (Mahout-math). It would also
> support creating distributed computation methods that would work on MLlib
> and Mahout datasets addressing Gokhan’s question.
> 
> I looked at another “Case-now” possibility, which was to go all MLlib with
> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> why would you want to do th

Re: Codebase refactoring proposal

2015-02-08 Thread Dmitriy Lyubimov
Pat,

I *just* made a case in this thread explaining that mllib does not have a
single distributed matrix types and that its own methodologies do not
interoperate within itself for that reason. Therefore, it is fundamentally
impossible to be interoperable with mllib since nobody really can define
what it means in terms of distributed types.

You are in fact referring  to their in-core type, not a distributed type.
But there's no linear algebra operation support to speak of there either.
It is, simply, not algebra, at the moment. The types in this hierarchy are
just memory storage models, and private scope converters to breeze storage
models, but they are not true linalg apis nor providers of such.

One might concievably want to standardize on Breeze apis since those are
both linalg api and providers, but not the type you've been mentioning.

However, it is not a very happy path either. Breeze is somewhat more
interesting substrate to build in-core operations on, but if you read spark
forum of late, even spark developers express a whiff of dissatisfaction
with it in favor of BIDMat (me too btw). But while they say Bidmat would be
a better choice for in-core operatros, they also recognize the fact that
they are too invested into breeze api by now and such move would not be
cheap across the board.

And that demonstrates another problem on in-core mllib architectrue  there:
on one side, they don't have sufficient public in-core dsl or api to speak
of; but they also do not have a sufficiently abstract api for in-core blas
plugins either to be truly agnostic of the available in-core methodologies.

So what you are talking about, is simply not possible with current state of
things there. But if it were, i'd just suggest you to try to port algebraic
things you like in Mahout, to mllib.

My guess however is that you'd find that porting algebraic optimizer with
proper level of consistency with in-core operations will not be easy for
reasons including, but not limited to, the ones i just mentioned; although
individual blas  like matrix square you've mentioned would be fairly easy
to do for one of the distributed matrix types in mllib. But that of course
would not be an R like environment and not an optimizer.

I like bidmat a lot though; but it is not truly hybrid and self-adjusting
environment for in-core operations either (and its dsl is neither Rlike nor
matlab like, so it takes a bit of adjusting to). For that reason even
Bidmat linalg types and dsl are not truly versatile enough for our (well,
my anyway) purposes (which are to find the best hardware or software
subroutine automatically given current hardware and software platform
architecture and parameters of the requested operation).
On Feb 8, 2015 9:05 AM, "Pat Ferrel"  wrote:

> Why aren’t we using linalg.Vector and its siblings? The same could be
> asked for linalg.Matrix. If we want to prune dependencies this would help
> and would also significantly increase interoperability.
>
> Case-now: I have a real need to cluster items in a CF type input matrix.
> The input matrix A’ has row of items. I need to drop this into a sequence
> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an
> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad
> and maybe could be helped with some implicit conversions mahout.Vector <->
> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> Kmeans).
>
> Case-possible: If we adopted linalg.Vector as the native format and
> perhaps even linalg.Matrix this would give immediate interoperability in
> some areas including my specific need. It would significantly pare down
> dependencies not provided by the environment (Mahout-math). It would also
> support creating distributed computation methods that would work on MLlib
> and Mahout datasets addressing Gokhan’s question.
>
> I looked at another “Case-now” possibility, which was to go all MLlib with
> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> why would you want to do that?” Not even in the multiply form A’A, A’B,
> AA’, all used in item and row similarity. That stopped me from looking
> deeper.
>
> The strength and unique value of Mahout is the completeness of its
> generalized linear algebra DSL. But insistence on using Mahout specific
> data types is also a barrier for Spark people adopting the DSL. Not having
> lower level interoperability is a barrier both ways to mixing Mahout and
> MLlib—creating unnecessary either/or choices for devs.
>
> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov  wrote:
>
> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan  wrote:
>
> > What I am saying is that for certain algorithms including both
> > engine-specific (such as aggregation) and DSL stuff, what is the best way
> > of handling them?
> >
> > i) should we add the distributed operations to Mahout codebase as it is
> > proposed in #62?
> >
>
> Imo this can't go very well and very far (because of the engine spec

Re: Codebase refactoring proposal

2015-02-08 Thread Pat Ferrel
Why aren’t we using linalg.Vector and its siblings? The same could be asked for 
linalg.Matrix. If we want to prune dependencies this would help and would also 
significantly increase interoperability.

Case-now: I have a real need to cluster items in a CF type input matrix. The 
input matrix A’ has row of items. I need to drop this into a sequence file and 
use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an RDD of 
linalg.Vectors and use MLlib Kmeans. The conversion is not too bad and maybe 
could be helped with some implicit conversions mahout.Vector <-> linalg.Vector 
(maybe mahout.DRM <-> linalg.Matrix, though not needed for Kmeans).

Case-possible: If we adopted linalg.Vector as the native format and perhaps 
even linalg.Matrix this would give immediate interoperability in some areas 
including my specific need. It would significantly pare down dependencies not 
provided by the environment (Mahout-math). It would also support creating 
distributed computation methods that would work on MLlib and Mahout datasets 
addressing Gokhan’s question.

I looked at another “Case-now” possibility, which was to go all MLlib with item 
similarity. I found that MLlib doesn’t have a transpose—“transpose, why would 
you want to do that?” Not even in the multiply form A’A, A’B, AA’, all used in 
item and row similarity. That stopped me from looking deeper.

The strength and unique value of Mahout is the completeness of its generalized 
linear algebra DSL. But insistence on using Mahout specific data types is also 
a barrier for Spark people adopting the DSL. Not having lower level 
interoperability is a barrier both ways to mixing Mahout and MLlib—creating 
unnecessary either/or choices for devs.

On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov  wrote:

On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan  wrote:

> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
> 
> i) should we add the distributed operations to Mahout codebase as it is
> proposed in #62?
> 

Imo this can't go very well and very far (because of the engine specifics)
but i'd be willing to see an experiment with simple things like map and
reduce.

Bigger quesitons are, where exactly we'll have to stop (we can't abstract
all capabilities out there becuase of "common denominator" issues), and
what percentage of methods will it truly allow to migrate to full backend
portability.

And if after doing all this, we will still find ourselves writing engine
specific mixes, why bother. Wouldn't it be better to find a good,
easy-to-replicate, incrementally-developed pattern to register and apply
engine-specific strategies for every method?


> 
> ii) should we have [engine]-ml modules (like spark-bindings and
> h2o-bindings) where we can mix the DSL and engine-specific stuff?
> 

This is not quite what i am proposing. Rather, engine-ml modules holding
engine-specific _parts_ of algorithm.

However, this really needs a POC over a guniea pig (similarly to how we
POC'd algebra in the first place with ssvd and spca).


> 
> 



Re: Codebase refactoring proposal

2015-02-05 Thread Dmitriy Lyubimov
On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan  wrote:

> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
>
> i) should we add the distributed operations to Mahout codebase as it is
> proposed in #62?
>

Imo this can't go very well and very far (because of the engine specifics)
but i'd be willing to see an experiment with simple things like map and
reduce.

Bigger quesitons are, where exactly we'll have to stop (we can't abstract
all capabilities out there becuase of "common denominator" issues), and
what percentage of methods will it truly allow to migrate to full backend
portability.

And if after doing all this, we will still find ourselves writing engine
specific mixes, why bother. Wouldn't it be better to find a good,
easy-to-replicate, incrementally-developed pattern to register and apply
engine-specific strategies for every method?


>
> ii) should we have [engine]-ml modules (like spark-bindings and
> h2o-bindings) where we can mix the DSL and engine-specific stuff?
>

This is not quite what i am proposing. Rather, engine-ml modules holding
engine-specific _parts_ of algorithm.

However, this really needs a POC over a guniea pig (similarly to how we
POC'd algebra in the first place with ssvd and spca).


>
>


Re: Codebase refactoring proposal

2015-02-05 Thread Pat Ferrel
From my own perspective:

I’m not aware of any rule to make all operations agnostic. In fact several 
engine specific exceptions are discussed in this long email. We’ve talked about 
reduce or join operations that would be difficult to make agnostic without a 
lot of knowledge of ALL other engines. Unless or until we get contributors from 
those engines reviewing commits, why put this burden on all of us?

An agnostic DSL was for linear algebra ops, not all distributed computation 
methods. We aren’t doing a generic engine only engine agnostic algebra. 

You have added stubs in H2O for the distributed aggregations. This seems fine 
but I wouldn’t vote to require that. If GSGD requires further use of Spark 
specific operations, so be it. This means that GSGD may live in the Spark 
module with any algebra bits required  added to math-scala. Does anyone have a 
problem with that?

My vote on #62—ship it.

On the point of interoperability with MLlib we still need talk about that but 
another email.


On Feb 5, 2015, at 1:14 AM, Gokhan Capan  wrote:

What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov  wrote:

> I took it Gokhan had objections himself, based on his comments. if we are
> talking about #62.
> 
> He also expressed concerns about computing GSGD but i suspect it can still
> be algebraically computed.
> 
> On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel  wrote:
> 
>> BTW Ted and Andrew have both expressed interest in the distributed
>> aggregation stuff. It sounds like we are agreeing that
>> non-algebra—computation method type things can be engine specific.
>> 
>> So does anyone have an objection to Gokhan pushing his PR?
>> 
>> On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov  wrote:
>> 
>> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo 
> wrote:
>> 
>>> 
>>> 
>>> 
>>> My thought was not to bring primitive engine specific aggregetors,
>>> combiners,  etc. into math-scala.
>>> 
>> 
>> Yeah. +1. I would like to support that as an experiment, see where it
> goes.
>> Clearly some distributed use cases are simple enough while also pervasive
>> enough.
>> 
>> 
> 



Re: Codebase refactoring proposal

2015-02-05 Thread Gokhan Capan
What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov  wrote:

> I took it Gokhan had objections himself, based on his comments. if we are
> talking about #62.
>
> He also expressed concerns about computing GSGD but i suspect it can still
> be algebraically computed.
>
> On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel  wrote:
>
> > BTW Ted and Andrew have both expressed interest in the distributed
> > aggregation stuff. It sounds like we are agreeing that
> > non-algebra—computation method type things can be engine specific.
> >
> > So does anyone have an objection to Gokhan pushing his PR?
> >
> > On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov  wrote:
> >
> > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo 
> wrote:
> >
> > >
> > >
> > >
> > > My thought was not to bring primitive engine specific aggregetors,
> > > combiners,  etc. into math-scala.
> > >
> >
> > Yeah. +1. I would like to support that as an experiment, see where it
> goes.
> > Clearly some distributed use cases are simple enough while also pervasive
> > enough.
> >
> >
>


Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
I took it Gokhan had objections himself, based on his comments. if we are
talking about #62.

He also expressed concerns about computing GSGD but i suspect it can still
be algebraically computed.

On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel  wrote:

> BTW Ted and Andrew have both expressed interest in the distributed
> aggregation stuff. It sounds like we are agreeing that
> non-algebra—computation method type things can be engine specific.
>
> So does anyone have an objection to Gokhan pushing his PR?
>
> On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov  wrote:
>
> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo  wrote:
>
> >
> >
> >
> > My thought was not to bring primitive engine specific aggregetors,
> > combiners,  etc. into math-scala.
> >
>
> Yeah. +1. I would like to support that as an experiment, see where it goes.
> Clearly some distributed use cases are simple enough while also pervasive
> enough.
>
>


Re: Codebase refactoring proposal

2015-02-04 Thread Pat Ferrel
BTW Ted and Andrew have both expressed interest in the distributed aggregation 
stuff. It sounds like we are agreeing that non-algebra—computation method type 
things can be engine specific.

So does anyone have an objection to Gokhan pushing his PR?

On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov  wrote:

On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo  wrote:

> 
> 
> 
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
> 

Yeah. +1. I would like to support that as an experiment, see where it goes.
Clearly some distributed use cases are simple enough while also pervasive
enough.



Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo  wrote:

>
>
>
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
>

Yeah. +1. I would like to support that as an experiment, see where it goes.
Clearly some distributed use cases are simple enough while also pervasive
enough.


Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
But also keep in mind that Flink folks are eager to allocate resources for
ML work. So maybe that's the way to work it -- create a DataFrame-based
seq2sparse port and then just hand it off to them to add to either Flink
directly (but with DRM output), or as a part of Mahout.

On Wed, Feb 4, 2015 at 2:07 PM, Dmitriy Lyubimov  wrote:

> Spark's DataFrame is obviously not agnostic.
>
> I don't believe there's a good way to abstract it. Unfortunately. I think
> getting too much into distributed operation abstraction is a bit dangerous.
>
> I think MLI was one project that attempted to do that -- but it did not
> take off i guess. or at least there were 0 commits in like 18 months there
> if i am not mistaken, and it never made it into spark tree.
>
> So it is a good question. if we need a dataframe in flink, what do we do.
> I am open to suggestions. I very much don't want to do "yet another
> abstract language-integrated Spark SQL" feature.
>
> Given resources, IMO it'd be better to take on fewer goals but make them
> shine. So i'd do spark-based seq2sparse version first and that'd give some
> ideas how to create ports/abstractions of that work to Flink.
>
>
>
> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo  wrote:
>
>>
>> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
>>
>>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
>>> there since they are going beyond the scope of that PR's work to chase
>>> the
>>> root of the issue.
>>>
>>> on quasi-algebraic methods
>>> 
>>>
>>> What is the dilemma here? don't see any.
>>>
>>> I already explained that no more than 25% of algorithms are truly 100%
>>> algebraic. But about 80% cannot avoid using some algebra and close to 95%
>>> could benefit from using algebra (even stochastic and monte carlo stuff).
>>>
>>> So we are building system that allows us to cut developer's work by at
>>> least 60% and make his work also more readable by 3000%. As far as I am
>>> concerned, that fulfills the goal. And I am perfectly happy writing a mix
>>> of engine-specific primitives and algebra.
>>>
>>> That's why i am a bit skeptical about attempts to abstract non-algebraic
>>> primitives such as row-wise aggregators in one of the pull requests.
>>> Engine-specific primitives and algebra can perfectly co-exist in the
>>> guts.
>>> And that's how i am doing my stuff in practice, except i now can skip 80%
>>> effort on algebra and bridging incompatible intputs-outputs.
>>>
>> I am **definitely** not advocating messing with the algebraic optimizer.
>> That was what I saw as the plus side to Gokhan's PR- a separate engine
>> abstraction for qasi/non-algebraic distributed methods.   I didn't comment
>> on the PR either because admittedly I did not have a chance to spend a lot
>> of time on it.  But my quick takeaway was  that we could take some very
>> useful and hopefully (close to) ubiquitous distributed operators and pass
>> them through to the engine "guts".
>>
>> I briefly looked through some of the flink and h2o code and noticed
>> Flink's aggregateOperator [1]
>> and h2o's MapReduce API and [2] my thought was that we could write pass
>> through operators for some of the more useful operations from math-scala
>> and then implement them fully in their respective packages.  Though I am
>> not sure how this would work on either cases w.r.t. partitioning.  e.g. on
>> h2o's distributed DataFrame. or flink for that matter.  Again, I havent had
>> alot of time to look at these and see if this would work at all.
>>
>> My thought was not to bring primitive engine specific aggregetors,
>> combiners,  etc. into math-scala.
>>
>> I had thought though that we were trying to develop a fully engine
>> agnostic algorithm library in on top of the R-Like distributed BLAS.
>>
>>
>> So would the idea be to implement i.e. seq2sparse fully in the spark
>> module?  It would seem to fracture the project a bit.
>>
>>
>> Or to implement algorithms sequentially if mapBlock() will not suffice
>> and then optimize them in their respective modules?
>>
>>
>>
>>
>>> None of that means that R-like algebra cannot be engine agnostic. So
>>> people
>>> are unhappy about not being able to write the whole in totaly agnostic
>>> way?
>>> And so they (falsely) infer the pieces of their work cannot be helped by
>>> agnosticism individually, or the tools are not being as good as they
>>> might
>>> be without backend agnosticism? Sorry, but I fail to see the logic there.
>>>
>>> We proved algebra can be agnostic. I don't think this notion should be
>>> disputed.
>>>
>>> And even if there were a shred of real benefit by making algebra tools
>>> un-agnostic, it would not ever outweigh tons of good we could get for the
>>> project by integrating with e.g. Flink folks. This one one the points
>>> MLLib
>>> will never be able to overcome -- to be truly shared ML platform where
>>> people could create and share ML, but not just a bunch of ad-hoc
>>> spaghetty
>>> of distributed api

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
Spark's DataFrame is obviously not agnostic.

I don't believe there's a good way to abstract it. Unfortunately. I think
getting too much into distributed operation abstraction is a bit dangerous.

I think MLI was one project that attempted to do that -- but it did not
take off i guess. or at least there were 0 commits in like 18 months there
if i am not mistaken, and it never made it into spark tree.

So it is a good question. if we need a dataframe in flink, what do we do. I
am open to suggestions. I very much don't want to do "yet another abstract
language-integrated Spark SQL" feature.

Given resources, IMO it'd be better to take on fewer goals but make them
shine. So i'd do spark-based seq2sparse version first and that'd give some
ideas how to create ports/abstractions of that work to Flink.



On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo  wrote:

>
> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
>
>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
>> there since they are going beyond the scope of that PR's work to chase the
>> root of the issue.
>>
>> on quasi-algebraic methods
>> 
>>
>> What is the dilemma here? don't see any.
>>
>> I already explained that no more than 25% of algorithms are truly 100%
>> algebraic. But about 80% cannot avoid using some algebra and close to 95%
>> could benefit from using algebra (even stochastic and monte carlo stuff).
>>
>> So we are building system that allows us to cut developer's work by at
>> least 60% and make his work also more readable by 3000%. As far as I am
>> concerned, that fulfills the goal. And I am perfectly happy writing a mix
>> of engine-specific primitives and algebra.
>>
>> That's why i am a bit skeptical about attempts to abstract non-algebraic
>> primitives such as row-wise aggregators in one of the pull requests.
>> Engine-specific primitives and algebra can perfectly co-exist in the guts.
>> And that's how i am doing my stuff in practice, except i now can skip 80%
>> effort on algebra and bridging incompatible intputs-outputs.
>>
> I am **definitely** not advocating messing with the algebraic optimizer.
> That was what I saw as the plus side to Gokhan's PR- a separate engine
> abstraction for qasi/non-algebraic distributed methods.   I didn't comment
> on the PR either because admittedly I did not have a chance to spend a lot
> of time on it.  But my quick takeaway was  that we could take some very
> useful and hopefully (close to) ubiquitous distributed operators and pass
> them through to the engine "guts".
>
> I briefly looked through some of the flink and h2o code and noticed
> Flink's aggregateOperator [1]
> and h2o's MapReduce API and [2] my thought was that we could write pass
> through operators for some of the more useful operations from math-scala
> and then implement them fully in their respective packages.  Though I am
> not sure how this would work on either cases w.r.t. partitioning.  e.g. on
> h2o's distributed DataFrame. or flink for that matter.  Again, I havent had
> alot of time to look at these and see if this would work at all.
>
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
>
> I had thought though that we were trying to develop a fully engine
> agnostic algorithm library in on top of the R-Like distributed BLAS.
>
>
> So would the idea be to implement i.e. seq2sparse fully in the spark
> module?  It would seem to fracture the project a bit.
>
>
> Or to implement algorithms sequentially if mapBlock() will not suffice and
> then optimize them in their respective modules?
>
>
>
>
>> None of that means that R-like algebra cannot be engine agnostic. So
>> people
>> are unhappy about not being able to write the whole in totaly agnostic
>> way?
>> And so they (falsely) infer the pieces of their work cannot be helped by
>> agnosticism individually, or the tools are not being as good as they might
>> be without backend agnosticism? Sorry, but I fail to see the logic there.
>>
>> We proved algebra can be agnostic. I don't think this notion should be
>> disputed.
>>
>> And even if there were a shred of real benefit by making algebra tools
>> un-agnostic, it would not ever outweigh tons of good we could get for the
>> project by integrating with e.g. Flink folks. This one one the points
>> MLLib
>> will never be able to overcome -- to be truly shared ML platform where
>> people could create and share ML, but not just a bunch of ad-hoc spaghetty
>> of distributed api calls and Spark-nailed black boxes.
>>
>> Well yes methodology implementations will still have native distributed
>> calls. Just not nearly as many as they otherwise would, and will be much
>> more easier to support on another back-end using Strategy patterns. E.g.
>> implicit feedback problem that i originally wrote as quasi-method for
>> Spark
>> only, would've taken just an hour or so to add strategy for flink, since
>> it
>> retains all in-core and distri

Re: Codebase refactoring proposal

2015-02-04 Thread Andrew Palumbo


On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:

Re: Gokhan's PR post: here are my thoughts but i did not want to post it
there since they are going beyond the scope of that PR's work to chase the
root of the issue.

on quasi-algebraic methods


What is the dilemma here? don't see any.

I already explained that no more than 25% of algorithms are truly 100%
algebraic. But about 80% cannot avoid using some algebra and close to 95%
could benefit from using algebra (even stochastic and monte carlo stuff).

So we are building system that allows us to cut developer's work by at
least 60% and make his work also more readable by 3000%. As far as I am
concerned, that fulfills the goal. And I am perfectly happy writing a mix
of engine-specific primitives and algebra.

That's why i am a bit skeptical about attempts to abstract non-algebraic
primitives such as row-wise aggregators in one of the pull requests.
Engine-specific primitives and algebra can perfectly co-exist in the guts.
And that's how i am doing my stuff in practice, except i now can skip 80%
effort on algebra and bridging incompatible intputs-outputs.
I am **definitely** not advocating messing with the algebraic 
optimizer.  That was what I saw as the plus side to Gokhan's PR- a 
separate engine abstraction for qasi/non-algebraic distributed methods. 
  I didn't comment on the PR either because admittedly I did not have a 
chance to spend a lot of time on it.  But my quick takeaway was  that we 
could take some very useful and hopefully (close to) ubiquitous 
distributed operators and pass them through to the engine "guts".


I briefly looked through some of the flink and h2o code and noticed 
Flink's aggregateOperator [1]
and h2o's MapReduce API and [2] my thought was that we could write pass 
through operators for some of the more useful operations from math-scala 
and then implement them fully in their respective packages.  Though I am 
not sure how this would work on either cases w.r.t. partitioning.  e.g. 
on h2o's distributed DataFrame. or flink for that matter.  Again, I 
havent had alot of time to look at these and see if this would work at all.


My thought was not to bring primitive engine specific aggregetors, 
combiners,  etc. into math-scala.


I had thought though that we were trying to develop a fully engine 
agnostic algorithm library in on top of the R-Like distributed BLAS.



So would the idea be to implement i.e. seq2sparse fully in the spark 
module?  It would seem to fracture the project a bit.



Or to implement algorithms sequentially if mapBlock() will not suffice 
and then optimize them in their respective modules?





None of that means that R-like algebra cannot be engine agnostic. So people
are unhappy about not being able to write the whole in totaly agnostic way?
And so they (falsely) infer the pieces of their work cannot be helped by
agnosticism individually, or the tools are not being as good as they might
be without backend agnosticism? Sorry, but I fail to see the logic there.

We proved algebra can be agnostic. I don't think this notion should be
disputed.

And even if there were a shred of real benefit by making algebra tools
un-agnostic, it would not ever outweigh tons of good we could get for the
project by integrating with e.g. Flink folks. This one one the points MLLib
will never be able to overcome -- to be truly shared ML platform where
people could create and share ML, but not just a bunch of ad-hoc spaghetty
of distributed api calls and Spark-nailed black boxes.

Well yes methodology implementations will still have native distributed
calls. Just not nearly as many as they otherwise would, and will be much
more easier to support on another back-end using Strategy patterns. E.g.
implicit feedback problem that i originally wrote as quasi-method for Spark
only, would've taken just an hour or so to add strategy for flink, since it
retains all in-core and distributed algebra work as is.

Not to mention benefit of single type pipelining.

And once we add hardware-accelerated bindings for in-core stuff, all these
methods would immediately benefit from it.

On MLLib interoperability issues,
=

well, let me ask you this: what it means to be MLLib-interoperable? is
MLLib even interoperable within itself?

E.g. i remember there was one most frequent request on the list here: how
can we cluster dimensionally-reduced data?

Let's look what it takes to do this in MLLib: First, we run tf-idf, which
produces collection of vectors (and where did our document ids go? not
sure); then we'd have to run svd or pca, both of which would accept
RowMatrix (bummer! but we have collection of vectors); which would produce
RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).

Not directly pluggable, although semi-trivially or trivially convertible.
Plus strips off information that we potentially already have computed
earlier in the pipeline, so we'd need to compute it

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
Re: Gokhan's PR post: here are my thoughts but i did not want to post it
there since they are going beyond the scope of that PR's work to chase the
root of the issue.

on quasi-algebraic methods


What is the dilemma here? don't see any.

I already explained that no more than 25% of algorithms are truly 100%
algebraic. But about 80% cannot avoid using some algebra and close to 95%
could benefit from using algebra (even stochastic and monte carlo stuff).

So we are building system that allows us to cut developer's work by at
least 60% and make his work also more readable by 3000%. As far as I am
concerned, that fulfills the goal. And I am perfectly happy writing a mix
of engine-specific primitives and algebra.

That's why i am a bit skeptical about attempts to abstract non-algebraic
primitives such as row-wise aggregators in one of the pull requests.
Engine-specific primitives and algebra can perfectly co-exist in the guts.
And that's how i am doing my stuff in practice, except i now can skip 80%
effort on algebra and bridging incompatible intputs-outputs.

None of that means that R-like algebra cannot be engine agnostic. So people
are unhappy about not being able to write the whole in totaly agnostic way?
And so they (falsely) infer the pieces of their work cannot be helped by
agnosticism individually, or the tools are not being as good as they might
be without backend agnosticism? Sorry, but I fail to see the logic there.

We proved algebra can be agnostic. I don't think this notion should be
disputed.

And even if there were a shred of real benefit by making algebra tools
un-agnostic, it would not ever outweigh tons of good we could get for the
project by integrating with e.g. Flink folks. This one one the points MLLib
will never be able to overcome -- to be truly shared ML platform where
people could create and share ML, but not just a bunch of ad-hoc spaghetty
of distributed api calls and Spark-nailed black boxes.

Well yes methodology implementations will still have native distributed
calls. Just not nearly as many as they otherwise would, and will be much
more easier to support on another back-end using Strategy patterns. E.g.
implicit feedback problem that i originally wrote as quasi-method for Spark
only, would've taken just an hour or so to add strategy for flink, since it
retains all in-core and distributed algebra work as is.

Not to mention benefit of single type pipelining.

And once we add hardware-accelerated bindings for in-core stuff, all these
methods would immediately benefit from it.

On MLLib interoperability issues,
=

well, let me ask you this: what it means to be MLLib-interoperable? is
MLLib even interoperable within itself?

E.g. i remember there was one most frequent request on the list here: how
can we cluster dimensionally-reduced data?

Let's look what it takes to do this in MLLib: First, we run tf-idf, which
produces collection of vectors (and where did our document ids go? not
sure); then we'd have to run svd or pca, both of which would accept
RowMatrix (bummer! but we have collection of vectors); which would produce
RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).

Not directly pluggable, although semi-trivially or trivially convertible.
Plus strips off information that we potentially already have computed
earlier in the pipeline, so we'd need to compute it again. I think problem
is well demonstrated.

Or, say, ALS stuff (implicit als in particular) is really an algebraic
problem. Should be taking input in form of matrices (that my feature
extraction algebraic pipeline perhaps has just prepared) but really takes
POJOs. Bummer again.

So what it is exactly we should be interoperable with in this picture if
MLLib itself is not consistent?

Let's look at the type system in flux there:

we have
(1) collection of vectors,
(2) matrix of known dimensions for collection of vectors (row matrix),
(3) indexedRowMatrix which is matrix of known dimension with keys that can
be _only_ long; and
(4) unknown but not infinitesimal amount of POJO-oriented approaches.

But ok, let's constrain ourselves to matrix types only.

Multitude of matrix types creates problems for tasks that require
consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
in the case of mllib). In the aforementioned case of dimensionality
reduction over document collection, there's simply no way to propagate
document ids to the rows of dimensionally-reduced data. As in none at all.
as in hard no-work-around-exists stop.

So. There's truly no need for multiple incompatible matrix types. There has
to be just single matrix type. Just flexible one. And everything algebraic
needs to use it.

And if geometry is needed, then it could be either already known or lazily
computed, but if it is not needed, nobody bothers to compute it. (i.e.
truly no need And this knowledge should not be lost just because we have to
convert between types.

A

Re: Codebase refactoring proposal

2015-02-04 Thread Pat Ferrel
gt;> [INFO] |  |  |  \-
>>>>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile
>>>>>>>>>> [INFO] |  |  | \- org.tukaani:xz:jar:1.0:compile
>>>>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  +-
>>>>>>>>>> 
>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  |  +-
>>>>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>> javax.inject:javax.inject:jar:1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>> aopalliance:aopalliance:jar:1.0:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>> 
>> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  +-
>>>>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | |  \-
>>>>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | | \-
>>>>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | |\-
>>>>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | |  \-
>>>>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | +-
>>>>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  | \-
>>>>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>>> com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  +-
>> com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  | \-
>>>>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>>>>> org.codehaus.j

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
rsey-test-framework-core:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  +-
>> > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
>> > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | |  \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | | \-
>> > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | |\-
>> > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | |  \-
>> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | +-
>> > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  | \-
>> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> > com.sun.jersey:jersey-server:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >> com.sun.jersey:jersey-core:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  +-
>> com.sun.jersey:jersey-json:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
>> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  | \-
>> > >>>>>>> javax.activation:activation:jar:1.1:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  +-
>> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>> > >>>>>>> [INFO] |  |  |  |  |  |  \-
>> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>> > >>>>>>> [INFO] |  |  |  |  |  \-
>> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>> > >>>>>>> [INFO] |  |  |  |  \-
>> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>>>>
>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  |  \-
>> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  +-
>> > >>>>>>>
>> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  |  \-
>> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>> > >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>> > >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>> > >>>>>>> [INFO] |  |  \-
&g

Re: Codebase refactoring proposal

2015-02-04 Thread Dmitriy Lyubimov
 | \-
> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  +-
> > com.sun.jersey:jersey-server:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  \-
> > >> com.sun.jersey:jersey-core:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  +-
> com.sun.jersey:jersey-json:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  | \-
> > >>>>>>> javax.activation:activation:jar:1.1:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  +-
> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> > >>>>>>> [INFO] |  |  |  |  |  |  \-
> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> > >>>>>>> [INFO] |  |  |  |  |  \-
> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> > >>>>>>> [INFO] |  |  |  |  \-
> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>>>>
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>>
> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> > >>>>>>> [INFO] |  |  \-
> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> > >>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> > >>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> > >>>>>>> [INFO] |  |  \-
> > commons-httpclient:commons-httpclient:jar:3.1:compile
> > >>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> > >>>>>>> [INFO] |  |  +-
> > >> org.apache.curator:curator-framework:jar:2.4.0:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >> org.apache.curator:curator-client:jar:2.4.0:compile
> > >>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> > >>>>>>> [INFO] |  | \- jline:jline:jar:0.9.94:compile
> > >>>>>>> [INFO] |  +-
> > >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>>>>
> > >>
> > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> > >>>>>>> [INFO] |  |  +-
> > >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  |  +-
> > >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  |  \-
> > >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  |  \-
> > >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> > >>>>>>> [INFO] |  | \-
> > >>>>>>>
> > >>>>>>>
> > >>
> >
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> > >>>>>>> [INFO] |  |\-
> > >>>>>>>
> > >>
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> > >>>>>>>

Re: Codebase refactoring proposal

2015-02-04 Thread Suneel Marthi
.1:compile
> >>>>>>> [INFO] |  |  | \- org.tukaani:xz:jar:1.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  |  +-
> >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
> >>>>>>> [INFO] |  |  |  |  |  +-
> >>>>>>>
> >>>>>>>
> >>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>>>
> >>>>>>>
> >>
> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  +-
> >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> >>>>> com.sun.jersey:jersey-client:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  | +-
> >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  | |  \-
> >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  | | \-
> >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
> >>>>>>> [INFO] |  |  |  |  |  | |\-
> >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
> >>>>>>> [INFO] |  |  |  |  |  | +-
> >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  | |  \-
> >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  | +-
> >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
> >>>>>>> [INFO] |  |  |  |  |  | \-
> >>>>> org.glassfish:javax.servlet:jar:3.1:compile
> >>>>>>> [INFO] |  |  |  |  |  +-
> com.sun.jersey:jersey-server:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >> com.sun.jersey:jersey-core:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> org.codehaus.jettison:jettison:jar:1.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  |  \-
> >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
> >>>>>>> [INFO] |  |  |  |  |  |  | \-
> >>>>>>> javax.activation:activation:jar:1.1:compile
> >>>>>>> [INFO] |  |  |  |  |  |  +-
> >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> >>>>>>> [INFO] |  |  |  |  |  |  \-
> >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> >>>>>>> [INFO] |  |  |  |  |  \-
> >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> >>>>>>> [INFO] |  |  |  |  \-
> >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  |  \-
> >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> >>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> >>

Re: Codebase refactoring proposal

2015-02-04 Thread Pat Ferrel
t; com.sun.jersey:jersey-client:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  | +-
>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  | |  \-
>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  | | \-
>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
>>>>>>> [INFO] |  |  |  |  |  | |\-
>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile
>>>>>>> [INFO] |  |  |  |  |  | +-
>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  | |  \-
>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  | +-
>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
>>>>>>> [INFO] |  |  |  |  |  | \-
>>>>> org.glassfish:javax.servlet:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>> com.sun.jersey:jersey-core:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> org.codehaus.jettison:jettison:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
>>>>>>> [INFO] |  |  |  |  |  |  |  \-
>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile
>>>>>>> [INFO] |  |  |  |  |  |  | \-
>>>>>>> javax.activation:activation:jar:1.1:compile
>>>>>>> [INFO] |  |  |  |  |  |  +-
>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  |  \-
>>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>>>>>> [INFO] |  |  |  |  |  \-
>>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>>>>>> [INFO] |  |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>>>>>> [INFO] |  |  |  \-
>>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>>>>>> [INFO] |  |  \-
>> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>>>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>>>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>>>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>>>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>>>>>> [INFO] |  |  +-
>> org.apache.curator:curator-framework:jar:2.4.0:compile
>>>>>>> [INFO] |  |  |  \-
>> org.apache.curator:curator-client:jar:2.4.0:compile
>>>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>>>> [INFO] |  | \- jline:jline:jar:0.9.94:compile
>>>>>>> [INFO] |  +-
>> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  +-
>>>>>>> 
>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  +-
>>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>>>> [INFO] |  |  |  \-
>>>>>>

Re: Codebase refactoring proposal

2015-02-04 Thread Andrew Palumbo
out-hadoop:jar:1.0-SNAPSHOT
*[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
[INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
*[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
[INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
[INFO] +-

org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test

[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile


On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 

wrote:

Looks like Guava is in Spark.

On Jan 29, 2015, at 4:03 PM, Pat Ferrel 

wrote:

IndexedDataset uses Guava. Can’t tell from sure but it sounds like

this

would not be included since I think it was taken from the mrlegacy

jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 

wrote:

-- Forwarded message --
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:


When you get a chance a PR would be good.

Yes, it would. And not just for that.


As I understand it you are putting some class jars somewhere in

the

classpath. Where? How?
/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path

tries

to

use
examples assemblies if not packaged, or /lib if packaged. True

motivation

of that significantly predates 2010 and i suspect only Benson knows

whole

true intent there.

The spark path, which is really a quick hack of the script, tries

to

get

only selected mahout jars and locally instlalled spark classpath

which i

guess is just the shaded spark jar in recent spark releases. It

also

apparently tries to include /libs/*, which is never compiled in

unpackaged

version, and now i think it is a bug it is included  because

/libs/*

is

apparently legacy packaging, and shouldnt be used  in spark jobs

with a

wildcard. I cant beleive how lazy i am, i still did not find time

to

understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark,

honestly,

because of the /lib. Never tried that, since i mostly use

application

embedding techniques.

The same solution may apply to adding external dependencies and

removing

the assembly in the Spark module. Which would leave only one major

build

issue afaik.

On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 
wrote:

No, no PR. Only experiment on private. But i believe i

sufficiently

defined

what i want to do in order to gauge if we may want to advance it

some

time

later. Goal is much lighter dependency for spark code. Eliminate

everything

that is not compile-time dependent. (and a lot of it is thru

legacy

MR

code

which we of course don't use).

Cant say i understand the remaining issues you are talking about

though.

If you are talking about compiling lib or shaded assembly, no,

this

doesn't

do anything about it. Although point is, as it stands, the algebra

and

shell don't have any external dependencies but spark and these 4

(5?)

mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some

external

dependencies, but that's a different scenario from those i am

talking

about. But i am relatively happy with having the first two working

nicely

at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel <

p...@occamsmachete.com>

wrote:

+1

Is there a PR? You mention a "tiny mahout-hadoop” module. It

would

be

nice

to see how you’ve structured that in case we can use the same

model to

solve the two remaining refactoring issues.
1) external dependencies in the spark module
2) no spark or h2o in the release artifacts.

On Jan 23, 2015, at 6:45 PM, Shannon Quinn 

wrote:

Also +1

iPhone'd


On Jan 23, 2015, at 18:38, Andrew Palumbo 

wrote:

+1


Sent from my Verizon Wireless 4G LTE smartphone

 Original message From: Dmitriy

Lyubimov

 Date:01/23/2015  6:06 PM

(GMT-05:00)

To: dev@mahout.apache.org Subject:

Codebase

refactoring proposal 


So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it turns out it only _irrevocably_

depends

on

the following classes there:

MatrixWritable, VectorWritable, Varint/Varlong and

VarintWritable,

and

...

*sigh* o.a.m.common.Pair

So  I just dropped those five classes into new a new tiny

mahout-hadoop

module (to signify stuff that is directly relevant to

serializing

thigns

to

DFS API) and completely removed mrlegacy and its transients from

spark

and

spark-shell dependencies.

So non-cli applications (shell scripts and embedded api use)

actually

only

need spark dependencies (which come from SPARK_HOME classpath,

of

course)

and mahout jars (mahout-spark, mahout-math(-scala),

mahout-hadoop

and

optionally mahout-spark-shell (for running shell)).

This of course still d

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
  |  |  |  |  |  | \-
> >>>>> javax.activation:activation:jar:1.1:compile
> >>>>> [INFO] |  |  |  |  |  |  +-
> >>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> >>>>> [INFO] |  |  |  |  |  |  \-
> >>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> >>>>> [INFO] |  |  |  |  |  \-
> >>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> >>>>> [INFO] |  |  |  |  \-
> >>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  \-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> >>>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> >>>>> [INFO] |  |  |  \-
> >>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> >>>>> [INFO] |  |  \-
> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> >>>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> >>>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> >>>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> >>>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> >>>>> [INFO] |  |  +-
> org.apache.curator:curator-framework:jar:2.4.0:compile
> >>>>> [INFO] |  |  |  \-
> org.apache.curator:curator-client:jar:2.4.0:compile
> >>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> >>>>> [INFO] |  | \- jline:jline:jar:0.9.94:compile
> >>>>> [INFO] |  +-
> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  +-
> >>>>>
> >>>
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> >>>>> [INFO] |  |  +-
> >>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  |  +-
> >>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  |  \-
> >>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  \-
> >>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  | \-
> >>>>>
> >>>>>
> >>>
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> >>>>> [INFO] |  |\-
> >>>>>
> >>>
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> >>>>> [INFO] |  +-
> >>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  +-
> org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  +-
> >>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> >>>>> [INFO] |  |  +-
> >>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  |  \-
> >>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  | \-
> >>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> >>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> >>>>> d
> >>>>>
> >>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov  >
> >>>>> wrote:
> >>>>>
> >>>>>> looks like it is also requested by mahout-math, wonder what is using
> >>> it
> >>>>>> there.
> >>>>>>
> >>>>>> At very least, it needs to be synchronized to the one currently used
> >>> by
> >>>>>> spark.
> >>>>>>
> >>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> >>> mahout-hadoop
> >>>>>> ---
> >>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> >>>>>> *[INFO] +- org.apache.mahout:mahout-m

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
t;> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
>>>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>>>> [INFO] |  | \- jline:jline:jar:0.9.94:compile
>>>>> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  +-
>>>>> 
>>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile 
>>>>> [INFO] |  |  +-
>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  |  +-
>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  |  \-
>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  \-
>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  | \-
>>>>> 
>>>>> 
>>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>>  
>>>>> [INFO] |  |\-
>>>>> 
>>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile 
>>>>> [INFO] |  +-
>>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +-
>>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>>>> [INFO] |  |  +-
>>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  |  \-
>>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  | \-
>>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>>>> d
>>>>> 
>>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
>>>>> wrote:
>>>>> 
>>>>>> looks like it is also requested by mahout-math, wonder what is using
>>> it
>>>>>> there.
>>>>>> 
>>>>>> At very least, it needs to be synchronized to the one currently used
>>> by
>>>>>> spark.
>>>>>> 
>>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
>>> mahout-hadoop
>>>>>> ---
>>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>>>> [INFO] +-
>>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 
>>>>> wrote:
>>>>>>> Looks like Guava is in Spark.
>>>>>>> 
>>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel 
>>> wrote:
>>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
>>> this
>>>>>>> would not be included since I think it was taken from the mrlegacy
>>> jar.
>>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 
>>>>> wrote:
>>>>>>> -- Forwarded message --
>>>>>>> From: "Pat Ferrel" 
>>>>>>> Date: Jan 25, 2015 9:39 AM
>>>>>>> Subject: Re: Codebase refactoring proposal
>>>>>>> To: 
>>>>>>> Cc:
>>>>>>> 
>>>>>>>> When you get a chance a PR would be good.
>>>>>>> Yes, it would. And not just for that.
>>>>>>> 
>>>>>>>> As I understand it you are putting some class jars somewhere in the
>>>>>>> classpath. Where? How?
>>>>>>> /bin/mahout
>>>>>>> 
>>>>>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>>>>

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
c:jar:1.3:compile
[INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] |  |  +- 
org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO] |  |  |  \- 
org.apache.curator:curator-client:jar:2.4.0:compile

[INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] |  | \- jline:jline:jar:0.9.94:compile
[INFO] |  +- 
org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile

[INFO] |  |  +-

org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile 


[INFO] |  |  +-

org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile

[INFO] |  |  |  +-

org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile

[INFO] |  |  |  \-
org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \-

org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile

[INFO] |  | \-


org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile 


[INFO] |  |\-

org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile 


[INFO] |  +-

org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO] |  +- 
org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile

[INFO] |  +-

org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile

[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \-

org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile

[INFO] |  | \-

org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile

[INFO] |  +- com.google.guava:guava:jar:16.0:compile
d

On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
wrote:


looks like it is also requested by mahout-math, wonder what is using

it

there.

At very least, it needs to be synchronized to the one currently used

by

spark.

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @

mahout-hadoop

---
[INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
*[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
[INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
*[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
[INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
[INFO] +-

org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test

[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile


On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 

wrote:

Looks like Guava is in Spark.

On Jan 29, 2015, at 4:03 PM, Pat Ferrel 

wrote:

IndexedDataset uses Guava. Can’t tell from sure but it sounds like

this

would not be included since I think it was taken from the mrlegacy

jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 

wrote:

------ Forwarded message ------
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:


When you get a chance a PR would be good.

Yes, it would. And not just for that.

As I understand it you are putting some class jars somewhere in 
the

classpath. Where? How?
/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path 
tries

to

use
examples assemblies if not packaged, or /lib if packaged. True

motivation

of that significantly predates 2010 and i suspect only Benson knows

whole

true intent there.

The spark path, which is really a quick hack of the script, 
tries to

get

only selected mahout jars and locally instlalled spark classpath

which i
guess is just the shaded spark jar in recent spark releases. It 
also

apparently tries to include /libs/*, which is never compiled in

unpackaged
version, and now i think it is a bug it is included  because 
/libs/*

is

apparently legacy packaging, and shouldnt be used  in spark jobs

with a
wildcard. I cant beleive how lazy i am, i still did not find 
time to

understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, 
honestly,
because of the /lib. Never tried that, since i mostly use 
application

embedding techniques.

The same solution may apply to adding external dependencies and

removing

the assembly in the Spark module. Which would leave only one major

build

issue afaik.

On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 

wrote:
No, no PR. Only experiment on private. But i believe i 
sufficiently

defined

what i want to do in order to gauge if we may want to advance it

some

time

later. Goal is much lighter dependency for spark code. Eliminate

everything
that is not compile-time dependent. (and a lot of it is thru 
legacy

MR

code

which we of course don't use).

Cant say i understand the remaining issues you are talking about

though.
If you are talking about compiling lib or shaded assembly, no, 
this

doesn't

do anything about it. Although point is, as

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
etty:jetty-webapp:jar:8.1.14.v20131031:compile

[INFO] |  |  |  +-

org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile

[INFO] |  |  |  \-
org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \-

org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile

[INFO] |  | \-



org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile

[INFO] |  |\-


org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile

[INFO] |  +-

org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile

[INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] |  +-

org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile

[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \-

org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile

[INFO] |  | \-

org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile

[INFO] |  +- com.google.guava:guava:jar:16.0:compile
d

On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
wrote:


looks like it is also requested by mahout-math, wonder what is using

it

there.

At very least, it needs to be synchronized to the one currently used

by

spark.

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @

mahout-hadoop

---
[INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
*[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
[INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
*[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
[INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
[INFO] +-

org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test

[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile


On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 

wrote:

Looks like Guava is in Spark.

On Jan 29, 2015, at 4:03 PM, Pat Ferrel 

wrote:

IndexedDataset uses Guava. Can’t tell from sure but it sounds like

this

would not be included since I think it was taken from the mrlegacy

jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 

wrote:

-- Forwarded message --
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:


When you get a chance a PR would be good.

Yes, it would. And not just for that.


As I understand it you are putting some class jars somewhere in the

classpath. Where? How?
/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries

to

use
examples assemblies if not packaged, or /lib if packaged. True

motivation

of that significantly predates 2010 and i suspect only Benson knows

whole

true intent there.

The spark path, which is really a quick hack of the script, tries to

get

only selected mahout jars and locally instlalled spark classpath

which i

guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in

unpackaged

version, and now i think it is a bug it is included  because /libs/*

is

apparently legacy packaging, and shouldnt be used  in spark jobs

with a

wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and

removing

the assembly in the Spark module. Which would leave only one major

build

issue afaik.

On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 

wrote:

No, no PR. Only experiment on private. But i believe i sufficiently

defined

what i want to do in order to gauge if we may want to advance it

some

time

later. Goal is much lighter dependency for spark code. Eliminate

everything

that is not compile-time dependent. (and a lot of it is thru legacy

MR

code

which we of course don't use).

Cant say i understand the remaining issues you are talking about

though.

If you are talking about compiling lib or shaded assembly, no, this

doesn't

do anything about it. Although point is, as it stands, the algebra

and

shell don't have any external dependencies but spark and these 4

(5?)

mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some

external

dependencies, but that's a different scenario from those i am

talking

about. But i am relatively happy with having the first two working

nicely

at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel 

wrote:

+1

Is there a PR? You mention a "tiny mahout-hadoop” module. It would

be

nice

to see how you’ve structured that i

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
haus.jackson:jackson-jaxrs:jar:1.8.3:compile
>>> [INFO] |  |  |  |  |  |  \-
>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
>>> [INFO] |  |  |  |  |  \-
>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
>>> [INFO] |  |  |  |  \-
>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
>>> [INFO] |  |  |  \-
>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
>>> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
>>> [INFO] |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
>>> [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
>>> [INFO] |  |  +-
>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
>>> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
>>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>>> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
>>> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
>>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>>> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
>>> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
>>> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
>>> [INFO] |  | \- jline:jline:jar:0.9.94:compile
>>> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  +-
>>> 
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
>>> [INFO] |  |  +-
> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  |  +-
> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  |  \-
>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  \-
> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
>>> [INFO] |  | \-
>>> 
>>> 
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
>>> [INFO] |  |\-
>>> 
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
>>> [INFO] |  +-
> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
>>> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
>>> [INFO] |  +-
> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  +-
>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
>>> [INFO] |  |  +-
>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
>>> [INFO] |  |  \-
> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
>>> [INFO] |  |     \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
>>> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
>>> d
>>> 
>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
>>> wrote:
>>> 
>>>> looks like it is also requested by mahout-math, wonder what is using
> it
>>>> there.
>>>> 
>>>> At very least, it needs to be synchronized to the one currently used
> by
>>>> spark.
>>>> 
>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-hadoop
>>>> ---
>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
>>>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
>>>> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
>>>> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
>>>> [INFO] +-
> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
>>>> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>>>> 
>>>> 
>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 
>>> wrote:
>>>> 
>>>>> Looks like Guava is in Spark.
>>>>> 
>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel 
> wrote:
>>>>> 
>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
> this
>>>>> would not be included since I think it was taken from the mrlegacy
> jar.
>>>>> 
>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 
>>> wrote:
>>>>> 
>>>>> -- Forwarded message 

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
pile
> > [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> > [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> > [INFO] |  |  +-
> > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> > [INFO] |  |  +-
> > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> > [INFO] |  |  \-
org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> > [INFO] |  | \-
org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> > d
> >
> > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
> > wrote:
> >
> >> looks like it is also requested by mahout-math, wonder what is using it
> >> there.
> >>
> >> At very least, it needs to be synchronized to the one currently used by
> >> spark.
> >>
> >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
mahout-hadoop
> >> ---
> >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> >> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> >> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> >> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> >> [INFO] +-
org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> >> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >>
> >>
> >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 
> > wrote:
> >>
> >>> Looks like Guava is in Spark.
> >>>
> >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:
> >>>
> >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
this
> >>> would not be included since I think it was taken from the mrlegacy
jar.
> >>>
> >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 
> > wrote:
> >>>
> >>> -- Forwarded message --
> >>> From: "Pat Ferrel" 
> >>> Date: Jan 25, 2015 9:39 AM
> >>> Subject: Re: Codebase refactoring proposal
> >>> To: 
> >>> Cc:
> >>>
> >>>> When you get a chance a PR would be good.
> >>>
> >>> Yes, it would. And not just for that.
> >>>
> >>>> As I understand it you are putting some class jars somewhere in the
> >>> classpath. Where? How?
> >>>>
> >>>
> >>> /bin/mahout
> >>>
> >>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> >>> 'bin/mahout -spark'.)
> >>>
> >>> If i interpret current shell code there correctky, legacy path tries
to
> >>> use
> >>> examples assemblies if not packaged, or /lib if packaged. True
> > motivation
> >>> of that significantly predates 2010 and i suspect only Benson knows
> > whole
> >>> true intent there.
> >>>
> >>> The spark path, which is really a quick hack of the script, tries to
get
> >>> only selected mahout jars and locally instlalled spark classpath
which i
> >>> guess is just the shaded spark jar in recent spark releases. It also
> >>> apparently tries to include /libs/*, which is never compiled in
> > unpackaged
> >>> version, and now i think it is a bug it is included  because /libs/*
is
> >>> apparently legacy packaging, and shouldnt be used  in spark jobs with
a
> >>> wildcard. I cant beleive how lazy i am, i still did not find time to
> >>> understand mahout build in all cases.
> >>>
> >>> I am not even sure if packaged mahout will work with spark, honestly,
> >>> because of the /lib. Never tried that, since i mostly use application
> >>> embedding techniques.
> >>>
> >>> The same solution may apply to adding external dependencies and
removing
> >>> the assembly in the Spark module. Which would leave only one major
build
> >>> issue afaik.
> >>>>
> >>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 
> >>> wrote:
> >>>>
> >>>> No, no PR. Only experiment on private. But i believe i sufficiently
> >>> defined
> >>>> what i want to do in order to gauge if we may want to advance it some
> >>> time
> >>>> later. Goal is much light

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
.0:compile
> > > [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> > > [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> > > [INFO] |  | \- jline:jline:jar:0.9.94:compile
> > > [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  +-
> > >
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> > > [INFO] |  |  +-
> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  |  +-
> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  |  \-
> > > org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  \-
> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> > > [INFO] |  | \-
> > >
> > >
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> > > [INFO] |  |\-
> > >
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> > > [INFO] |  +-
> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> > > [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> > > [INFO] |  +-
> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  +-
> > > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> > > [INFO] |  |  +-
> > > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> > > [INFO] |  |  \-
> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> > > [INFO] |  | \-
> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> > > [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> > > d
> > >
> > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
> > > wrote:
> > >
> > >> looks like it is also requested by mahout-math, wonder what is using
> it
> > >> there.
> > >>
> > >> At very least, it needs to be synchronized to the one currently used
> by
> > >> spark.
> > >>
> > >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> mahout-hadoop
> > >> ---
> > >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> > >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> > >> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> > >> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> > >> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> > >> [INFO] +-
> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> > >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > >> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> > >>
> > >>
> > >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 
> > > wrote:
> > >>
> > >>> Looks like Guava is in Spark.
> > >>>
> > >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel 
> wrote:
> > >>>
> > >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like
> this
> > >>> would not be included since I think it was taken from the mrlegacy
> jar.
> > >>>
> > >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 
> > > wrote:
> > >>>
> > >>> -- Forwarded message --
> > >>> From: "Pat Ferrel" 
> > >>> Date: Jan 25, 2015 9:39 AM
> > >>> Subject: Re: Codebase refactoring proposal
> > >>> To: 
> > >>> Cc:
> > >>>
> > >>>> When you get a chance a PR would be good.
> > >>>
> > >>> Yes, it would. And not just for that.
> > >>>
> > >>>> As I understand it you are putting some class jars somewhere in the
> > >>> classpath. Where? How?
> > >>>>
> > >>>
> > >>> /bin/mahout
> > >>>
> > >>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> > >>> 'bin/mahout -spark'.)
> > >>>
> > >>> If i interpret current shell code there correctky, legacy path tries
> to
> > >>> use
> > >>> examples assemblies if not packaged, or /lib if packaged. True
> > > motivation
> > >>> of that significantly predates 2010 and i suspect only Benson knows
> > > whole
> > >>> true intent 

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
ell from sure but it sounds like this
would not be included since I think it was taken from the mrlegacy jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 

wrote:

-- Forwarded message --
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:


When you get a chance a PR would be good.

Yes, it would. And not just for that.


As I understand it you are putting some class jars somewhere in the

classpath. Where? How?
/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries to
use
examples assemblies if not packaged, or /lib if packaged. True

motivation

of that significantly predates 2010 and i suspect only Benson knows

whole

true intent there.

The spark path, which is really a quick hack of the script, tries to get
only selected mahout jars and locally instlalled spark classpath which i
guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in

unpackaged

version, and now i think it is a bug it is included  because /libs/* is
apparently legacy packaging, and shouldnt be used  in spark jobs with a
wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and removing
the assembly in the Spark module. Which would leave only one major build
issue afaik.

On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 

wrote:

No, no PR. Only experiment on private. But i believe i sufficiently

defined

what i want to do in order to gauge if we may want to advance it some

time

later. Goal is much lighter dependency for spark code. Eliminate

everything

that is not compile-time dependent. (and a lot of it is thru legacy MR

code

which we of course don't use).

Cant say i understand the remaining issues you are talking about

though.

If you are talking about compiling lib or shaded assembly, no, this

doesn't

do anything about it. Although point is, as it stands, the algebra and
shell don't have any external dependencies but spark and these 4 (5?)
mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some

external

dependencies, but that's a different scenario from those i am talking
about. But i am relatively happy with having the first two working

nicely

at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel 

wrote:

+1

Is there a PR? You mention a "tiny mahout-hadoop” module. It would be

nice

to see how you’ve structured that in case we can use the same model to
solve the two remaining refactoring issues.
1) external dependencies in the spark module
2) no spark or h2o in the release artifacts.

On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:

Also +1

iPhone'd


On Jan 23, 2015, at 18:38, Andrew Palumbo 

wrote:

+1


Sent from my Verizon Wireless 4G LTE smartphone

 Original message ----From: Dmitriy

Lyubimov

 Date:01/23/2015  6:06 PM  (GMT-05:00)
To: dev@mahout.apache.org Subject: Codebase
refactoring proposal 


So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it turns out it only _irrevocably_

depends

on

the following classes there:

MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,

and

...

*sigh* o.a.m.common.Pair

So  I just dropped those five classes into new a new tiny

mahout-hadoop

module (to signify stuff that is directly relevant to serializing

thigns

to

DFS API) and completely removed mrlegacy and its transients from

spark

and

spark-shell dependencies.

So non-cli applications (shell scripts and embedded api use) actually

only

need spark dependencies (which come from SPARK_HOME classpath, of

course)

and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
optionally mahout-spark-shell (for running shell)).

This of course still doesn't address driver problems that want to

throw

more stuff into front-end classpath (such as cli parser) but at least

it

renders transitive luggage of mr-legacy (and the size of

worker-shipped

jars) much more tolerable.

How does that sound?











Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>> 
>> 
>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 
> wrote:
>> 
>>> Looks like Guava is in Spark.
>>> 
>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:
>>> 
>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
>>> would not be included since I think it was taken from the mrlegacy jar.
>>> 
>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 
> wrote:
>>> 
>>> -- Forwarded message --
>>> From: "Pat Ferrel" 
>>> Date: Jan 25, 2015 9:39 AM
>>> Subject: Re: Codebase refactoring proposal
>>> To: 
>>> Cc:
>>> 
>>>> When you get a chance a PR would be good.
>>> 
>>> Yes, it would. And not just for that.
>>> 
>>>> As I understand it you are putting some class jars somewhere in the
>>> classpath. Where? How?
>>>> 
>>> 
>>> /bin/mahout
>>> 
>>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>>> 'bin/mahout -spark'.)
>>> 
>>> If i interpret current shell code there correctky, legacy path tries to
>>> use
>>> examples assemblies if not packaged, or /lib if packaged. True
> motivation
>>> of that significantly predates 2010 and i suspect only Benson knows
> whole
>>> true intent there.
>>> 
>>> The spark path, which is really a quick hack of the script, tries to get
>>> only selected mahout jars and locally instlalled spark classpath which i
>>> guess is just the shaded spark jar in recent spark releases. It also
>>> apparently tries to include /libs/*, which is never compiled in
> unpackaged
>>> version, and now i think it is a bug it is included  because /libs/* is
>>> apparently legacy packaging, and shouldnt be used  in spark jobs with a
>>> wildcard. I cant beleive how lazy i am, i still did not find time to
>>> understand mahout build in all cases.
>>> 
>>> I am not even sure if packaged mahout will work with spark, honestly,
>>> because of the /lib. Never tried that, since i mostly use application
>>> embedding techniques.
>>> 
>>> The same solution may apply to adding external dependencies and removing
>>> the assembly in the Spark module. Which would leave only one major build
>>> issue afaik.
>>>> 
>>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 
>>> wrote:
>>>> 
>>>> No, no PR. Only experiment on private. But i believe i sufficiently
>>> defined
>>>> what i want to do in order to gauge if we may want to advance it some
>>> time
>>>> later. Goal is much lighter dependency for spark code. Eliminate
>>> everything
>>>> that is not compile-time dependent. (and a lot of it is thru legacy MR
>>> code
>>>> which we of course don't use).
>>>> 
>>>> Cant say i understand the remaining issues you are talking about
> though.
>>>> 
>>>> If you are talking about compiling lib or shaded assembly, no, this
>>> doesn't
>>>> do anything about it. Although point is, as it stands, the algebra and
>>>> shell don't have any external dependencies but spark and these 4 (5?)
>>>> mahout jars so they technically don't even need an assembly (as
>>>> demonstrated).
>>>> 
>>>> As i said, it seems driver code is the only one that may need some
>>> external
>>>> dependencies, but that's a different scenario from those i am talking
>>>> about. But i am relatively happy with having the first two working
>>> nicely
>>>> at this point.
>>>> 
>>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel 
>>> wrote:
>>>> 
>>>>> +1
>>>>> 
>>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
>>> nice
>>>>> to see how you’ve structured that in case we can use the same model to
>>>>> solve the two remaining refactoring issues.
>>>>> 1) external dependencies in the spark module
>>>>> 2) no spark or h2o in the release artifacts.
>>>>> 
>>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
>>>>> 
>>>>> Also +1
>>>>> 
>>>>> iPhone'd
>>>>> 
>>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo 
> wrote:
>>>>>> 
>>>>>> +1
>>>>>> 
>>>>>> 
>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>> 
>>>>>>  Original message From: Dmitriy
>>> Lyubimov
>>>>>  Date:01/23/2015  6:06 PM  (GMT-05:00)
>>>>> To: dev@mahout.apache.org Subject: Codebase
>>>>> refactoring proposal 
>>>>>> 
>>>>>> So right now mahout-spark depends on mr-legacy.
>>>>>> I did quick refactoring and it turns out it only _irrevocably_
> depends
>>> on
>>>>>> the following classes there:
>>>>>> 
>>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable,
> and
>>>>> ...
>>>>>> *sigh* o.a.m.common.Pair
>>>>>> 
>>>>>> So  I just dropped those five classes into new a new tiny
>>> mahout-hadoop
>>>>>> module (to signify stuff that is directly relevant to serializing
>>> thigns
>>>>> to
>>>>>> DFS API) and completely removed mrlegacy and its transients from
> spark
>>>>> and
>>>>>> spark-shell dependencies.
>>>>>> 
>>>>>> So non-cli applications (shell scripts and embedded api use) actually
>>>>> only
>>>>>> need spark dependencies (which come from SPARK_HOME classpath, of
>>> course)
>>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>>>>>> optionally mahout-spark-shell (for running shell)).
>>>>>> 
>>>>>> This of course still doesn't address driver problems that want to
>>> throw
>>>>>> more stuff into front-end classpath (such as cli parser) but at least
>>> it
>>>>>> renders transitive luggage of mr-legacy (and the size of
>>> worker-shipped
>>>>>> jars) much more tolerable.
>>>>>> 
>>>>>> How does that sound?
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 



Re: Codebase refactoring proposal

2015-02-02 Thread Dmitriy Lyubimov
org.glassfish:javax.servlet:jar:3.1:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- asm:asm:jar:3.1:compile
> [INFO] |  |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
> [INFO] |  |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
> [INFO] |  |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
> [INFO] |  |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
> [INFO] |  |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> [INFO] |  |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
> [INFO] |  |  |  |  |  |  | \-
> javax.activation:activation:jar:1.1:compile
> [INFO] |  |  |  |  |  |  +-
> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
> [INFO] |  |  |  |  |  |  \-
> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
> [INFO] |  |  |  |  |  \-
> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
> [INFO] |  |  |  |  \-
> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
> [INFO] |  |  |  \-
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
> [INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
> [INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
> [INFO] |  |  +-
> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
> [INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
> [INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
> [INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
> [INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
> [INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
> [INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
> [INFO] |  | \- jline:jline:jar:0.9.94:compile
> [INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
> [INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
> [INFO] |  |  |  \-
> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
> [INFO] |  | \-
>
> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
> [INFO] |  |\-
> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
> [INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
> [INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
> [INFO] |  |  +-
> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
> [INFO] |  |  +-
> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
> [INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
> [INFO] |  | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
> [INFO] |  +- com.google.guava:guava:jar:16.0:compile
> d
>
> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov 
> wrote:
>
> > looks like it is also requested by mahout-math, wonder what is using it
> > there.
> >
> > At very least, it needs to be synchronized to the one currently used by
> > spark.
> >
> > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
> > ---
> > [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> > *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> > [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> > *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> > [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> > [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> > [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> >
> >
> > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel 
> wrote:
> >
> >> Looks like Guava is in Spark.
> >>
> >> On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:
> >>
> >> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
> >> would not be included since I think it was taken from the mrlegacy jar.
> >>
> >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov 
> wrote:
> >>
> >> -- Forwarded message ---

Re: Codebase refactoring proposal

2015-01-31 Thread Pat Ferrel
adoop-yarn-common:jar:2.2.0:compile
[INFO] |  |  +-
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
[INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
[INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
[INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] |  | \- jline:jline:jar:0.9.94:compile
[INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
[INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
[INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
[INFO] |  |  |  \-
org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
[INFO] |  | \-
org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
[INFO] |  |\-
org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
[INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
[INFO] |  | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
[INFO] |  +- com.google.guava:guava:jar:16.0:compile
d

On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov  wrote:

> looks like it is also requested by mahout-math, wonder what is using it
> there.
> 
> At very least, it needs to be synchronized to the one currently used by
> spark.
> 
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
> ---
> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
> 
> 
> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel  wrote:
> 
>> Looks like Guava is in Spark.
>> 
>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:
>> 
>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
>> would not be included since I think it was taken from the mrlegacy jar.
>> 
>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov  wrote:
>> 
>> -- Forwarded message --
>> From: "Pat Ferrel" 
>> Date: Jan 25, 2015 9:39 AM
>> Subject: Re: Codebase refactoring proposal
>> To: 
>> Cc:
>> 
>>> When you get a chance a PR would be good.
>> 
>> Yes, it would. And not just for that.
>> 
>>> As I understand it you are putting some class jars somewhere in the
>> classpath. Where? How?
>>> 
>> 
>> /bin/mahout
>> 
>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>> 'bin/mahout -spark'.)
>> 
>> If i interpret current shell code there correctky, legacy path tries to
>> use
>> examples assemblies if not packaged, or /lib if packaged. True motivation
>> of that significantly predates 2010 and i suspect only Benson knows whole
>> true intent there.
>> 
>> The spark path, which is really a quick hack of the script, tries to get
>> only selected mahout jars and locally instlalled spark classpath which i
>> guess is just the shaded spark jar in recent spark releases. It also
>> apparently tries to include /libs/*, which is never compiled in unpackaged
>> version, and now i think it is a bug it is included  because /libs/* is
>> apparently legacy packaging, and shouldnt be used  in spark jobs with a
>> wildcard. I cant beleive how lazy i am, i still did not find time to
>> understand mahout build in all cases.
>> 
>> I am not even sure if packaged mahout will work with spark, honestly,
>> because of the /lib. Never tried that, since i mostly use application
>> embedding techniques.
&

Re: Codebase refactoring proposal

2015-01-30 Thread Dmitriy Lyubimov
pile
[INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] |  | \- jline:jline:jar:0.9.94:compile
[INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
[INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
[INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
[INFO] |  |  |  \-
org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
[INFO] |  | \-
org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
[INFO] |  |\-
org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
[INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] |  |  +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
[INFO] |  | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
[INFO] |  +- com.google.guava:guava:jar:16.0:compile
d

On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov  wrote:

> looks like it is also requested by mahout-math, wonder what is using it
> there.
>
> At very least, it needs to be synchronized to the one currently used by
> spark.
>
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
> ---
> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
> *[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
> [INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
> [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
>
>
> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel  wrote:
>
>> Looks like Guava is in Spark.
>>
>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:
>>
>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
>> would not be included since I think it was taken from the mrlegacy jar.
>>
>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov  wrote:
>>
>> -- Forwarded message --
>> From: "Pat Ferrel" 
>> Date: Jan 25, 2015 9:39 AM
>> Subject: Re: Codebase refactoring proposal
>> To: 
>> Cc:
>>
>> > When you get a chance a PR would be good.
>>
>> Yes, it would. And not just for that.
>>
>> > As I understand it you are putting some class jars somewhere in the
>> classpath. Where? How?
>> >
>>
>> /bin/mahout
>>
>> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
>> 'bin/mahout -spark'.)
>>
>> If i interpret current shell code there correctky, legacy path tries to
>> use
>> examples assemblies if not packaged, or /lib if packaged. True motivation
>> of that significantly predates 2010 and i suspect only Benson knows whole
>> true intent there.
>>
>> The spark path, which is really a quick hack of the script, tries to get
>> only selected mahout jars and locally instlalled spark classpath which i
>> guess is just the shaded spark jar in recent spark releases. It also
>> apparently tries to include /libs/*, which is never compiled in unpackaged
>> version, and now i think it is a bug it is included  because /libs/* is
>> apparently legacy packaging, and shouldnt be used  in spark jobs with a
>> wildcard. I cant beleive how lazy i am, i still did not find time to
>> understand mahout build in all cases.
>>
>> I am not even sure if packaged mahout will work with spark, honestly,
>> because of the /lib. Never tried that, since i mostly use application
>> embedding techniques.
>>
>> The same solution may apply to adding external dependencies and removing
>> the assembly in the Spark module. Which would leave only one major build
>> issue afaik.
>> >
>> > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 
>> wrote:
>> >
>> > No, no PR. Only experiment on private. But i believe i sufficiently
>> defined
>> > what i want to do in order to gauge if we may want to advance it some
>> time
>> > later. Goal is much lighter dependency for spark code. Eliminate
>> everything
>&

Re: Codebase refactoring proposal

2015-01-30 Thread Dmitriy Lyubimov
looks like it is also requested by mahout-math, wonder what is using it
there.

At very least, it needs to be synchronized to the one currently used by
spark.

[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop
---
[INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT
*[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile*
[INFO] |  +- org.apache.commons:commons-math3:jar:3.2:compile
*[INFO] |  +- com.google.guava:guava:jar:16.0:compile*
[INFO] |  \- com.tdunning:t-digest:jar:2.0.2:compile
[INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test
[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile


On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel  wrote:

> Looks like Guava is in Spark.
>
> On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:
>
> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this
> would not be included since I think it was taken from the mrlegacy jar.
>
> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov  wrote:
>
> -- Forwarded message --
> From: "Pat Ferrel" 
> Date: Jan 25, 2015 9:39 AM
> Subject: Re: Codebase refactoring proposal
> To: 
> Cc:
>
> > When you get a chance a PR would be good.
>
> Yes, it would. And not just for that.
>
> > As I understand it you are putting some class jars somewhere in the
> classpath. Where? How?
> >
>
> /bin/mahout
>
> (Computes 2 different classpaths. See  'bin/mahout classpath' vs.
> 'bin/mahout -spark'.)
>
> If i interpret current shell code there correctky, legacy path tries to use
> examples assemblies if not packaged, or /lib if packaged. True motivation
> of that significantly predates 2010 and i suspect only Benson knows whole
> true intent there.
>
> The spark path, which is really a quick hack of the script, tries to get
> only selected mahout jars and locally instlalled spark classpath which i
> guess is just the shaded spark jar in recent spark releases. It also
> apparently tries to include /libs/*, which is never compiled in unpackaged
> version, and now i think it is a bug it is included  because /libs/* is
> apparently legacy packaging, and shouldnt be used  in spark jobs with a
> wildcard. I cant beleive how lazy i am, i still did not find time to
> understand mahout build in all cases.
>
> I am not even sure if packaged mahout will work with spark, honestly,
> because of the /lib. Never tried that, since i mostly use application
> embedding techniques.
>
> The same solution may apply to adding external dependencies and removing
> the assembly in the Spark module. Which would leave only one major build
> issue afaik.
> >
> > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov 
> wrote:
> >
> > No, no PR. Only experiment on private. But i believe i sufficiently
> defined
> > what i want to do in order to gauge if we may want to advance it some
> time
> > later. Goal is much lighter dependency for spark code. Eliminate
> everything
> > that is not compile-time dependent. (and a lot of it is thru legacy MR
> code
> > which we of course don't use).
> >
> > Cant say i understand the remaining issues you are talking about though.
> >
> > If you are talking about compiling lib or shaded assembly, no, this
> doesn't
> > do anything about it. Although point is, as it stands, the algebra and
> > shell don't have any external dependencies but spark and these 4 (5?)
> > mahout jars so they technically don't even need an assembly (as
> > demonstrated).
> >
> > As i said, it seems driver code is the only one that may need some
> external
> > dependencies, but that's a different scenario from those i am talking
> > about. But i am relatively happy with having the first two working nicely
> > at this point.
> >
> > On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel 
> wrote:
> >
> >> +1
> >>
> >> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
> nice
> >> to see how you’ve structured that in case we can use the same model to
> >> solve the two remaining refactoring issues.
> >> 1) external dependencies in the spark module
> >> 2) no spark or h2o in the release artifacts.
> >>
> >> On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
> >>
> >> Also +1
> >>
> >> iPhone'd
> >>
> >>> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
> >>>
> >>> +1
> >>>
> >>>
> >>> Sent from my Verizon Wireless 4G LTE smartphone
> &g

Re: Codebase refactoring proposal

2015-01-30 Thread Pat Ferrel
Looks like Guava is in Spark.

On Jan 29, 2015, at 4:03 PM, Pat Ferrel  wrote:

IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would 
not be included since I think it was taken from the mrlegacy jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov  wrote:

-- Forwarded message --
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:

> When you get a chance a PR would be good.

Yes, it would. And not just for that.

> As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
> 

/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries to use
examples assemblies if not packaged, or /lib if packaged. True motivation
of that significantly predates 2010 and i suspect only Benson knows whole
true intent there.

The spark path, which is really a quick hack of the script, tries to get
only selected mahout jars and locally instlalled spark classpath which i
guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in unpackaged
version, and now i think it is a bug it is included  because /libs/* is
apparently legacy packaging, and shouldnt be used  in spark jobs with a
wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and removing
the assembly in the Spark module. Which would leave only one major build
issue afaik.
> 
> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov  wrote:
> 
> No, no PR. Only experiment on private. But i believe i sufficiently
defined
> what i want to do in order to gauge if we may want to advance it some time
> later. Goal is much lighter dependency for spark code. Eliminate
everything
> that is not compile-time dependent. (and a lot of it is thru legacy MR
code
> which we of course don't use).
> 
> Cant say i understand the remaining issues you are talking about though.
> 
> If you are talking about compiling lib or shaded assembly, no, this
doesn't
> do anything about it. Although point is, as it stands, the algebra and
> shell don't have any external dependencies but spark and these 4 (5?)
> mahout jars so they technically don't even need an assembly (as
> demonstrated).
> 
> As i said, it seems driver code is the only one that may need some
external
> dependencies, but that's a different scenario from those i am talking
> about. But i am relatively happy with having the first two working nicely
> at this point.
> 
> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel  wrote:
> 
>> +1
>> 
>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
nice
>> to see how you’ve structured that in case we can use the same model to
>> solve the two remaining refactoring issues.
>> 1) external dependencies in the spark module
>> 2) no spark or h2o in the release artifacts.
>> 
>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
>> 
>> Also +1
>> 
>> iPhone'd
>> 
>>> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
>>> 
>>> +1
>>> 
>>> 
>>> Sent from my Verizon Wireless 4G LTE smartphone
>>> 
>>>  Original message From: Dmitriy
Lyubimov
>>  Date:01/23/2015  6:06 PM  (GMT-05:00)
>> To: dev@mahout.apache.org Subject: Codebase
>> refactoring proposal 
>>> 
>>> So right now mahout-spark depends on mr-legacy.
>>> I did quick refactoring and it turns out it only _irrevocably_ depends
on
>>> the following classes there:
>>> 
>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
>> ...
>>> *sigh* o.a.m.common.Pair
>>> 
>>> So  I just dropped those five classes into new a new tiny mahout-hadoop
>>> module (to signify stuff that is directly relevant to serializing
thigns
>> to
>>> DFS API) and completely removed mrlegacy and its transients from spark
>> and
>>> spark-shell dependencies.
>>> 
>>> So non-cli applications (shell scripts and embedded api use) actually
>> only
>>> need spark dependencies (which come from SPARK_HOME classpath, of
course)
>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>>> optionally mahout-spark-shell (for running shell)).
>>> 
>>> This of course still doesn't address driver problems that want to throw
>>> more stuff into front-end classpath (such as cli parser) but at least
it
>>> renders transitive luggage of mr-legacy (and the size of worker-shipped
>>> jars) much more tolerable.
>>> 
>>> How does that sound?
>> 
>> 
> 




Re: Codebase refactoring proposal

2015-01-29 Thread Pat Ferrel
IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would 
not be included since I think it was taken from the mrlegacy jar.

On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov  wrote:

-- Forwarded message --
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:

> When you get a chance a PR would be good.

Yes, it would. And not just for that.

> As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
> 

/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries to use
examples assemblies if not packaged, or /lib if packaged. True motivation
of that significantly predates 2010 and i suspect only Benson knows whole
true intent there.

The spark path, which is really a quick hack of the script, tries to get
only selected mahout jars and locally instlalled spark classpath which i
guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in unpackaged
version, and now i think it is a bug it is included  because /libs/* is
apparently legacy packaging, and shouldnt be used  in spark jobs with a
wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and removing
the assembly in the Spark module. Which would leave only one major build
issue afaik.
> 
> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov  wrote:
> 
> No, no PR. Only experiment on private. But i believe i sufficiently
defined
> what i want to do in order to gauge if we may want to advance it some time
> later. Goal is much lighter dependency for spark code. Eliminate
everything
> that is not compile-time dependent. (and a lot of it is thru legacy MR
code
> which we of course don't use).
> 
> Cant say i understand the remaining issues you are talking about though.
> 
> If you are talking about compiling lib or shaded assembly, no, this
doesn't
> do anything about it. Although point is, as it stands, the algebra and
> shell don't have any external dependencies but spark and these 4 (5?)
> mahout jars so they technically don't even need an assembly (as
> demonstrated).
> 
> As i said, it seems driver code is the only one that may need some
external
> dependencies, but that's a different scenario from those i am talking
> about. But i am relatively happy with having the first two working nicely
> at this point.
> 
> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel  wrote:
> 
>> +1
>> 
>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
nice
>> to see how you’ve structured that in case we can use the same model to
>> solve the two remaining refactoring issues.
>> 1) external dependencies in the spark module
>> 2) no spark or h2o in the release artifacts.
>> 
>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
>> 
>> Also +1
>> 
>> iPhone'd
>> 
>>> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
>>> 
>>> +1
>>> 
>>> 
>>> Sent from my Verizon Wireless 4G LTE smartphone
>>> 
>>>  Original message From: Dmitriy
Lyubimov
>>  Date:01/23/2015  6:06 PM  (GMT-05:00)
>> To: dev@mahout.apache.org Subject: Codebase
>> refactoring proposal 
>>> 
>>> So right now mahout-spark depends on mr-legacy.
>>> I did quick refactoring and it turns out it only _irrevocably_ depends
on
>>> the following classes there:
>>> 
>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
>> ...
>>> *sigh* o.a.m.common.Pair
>>> 
>>> So  I just dropped those five classes into new a new tiny mahout-hadoop
>>> module (to signify stuff that is directly relevant to serializing
thigns
>> to
>>> DFS API) and completely removed mrlegacy and its transients from spark
>> and
>>> spark-shell dependencies.
>>> 
>>> So non-cli applications (shell scripts and embedded api use) actually
>> only
>>> need spark dependencies (which come from SPARK_HOME classpath, of
course)
>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>>> optionally mahout-spark-shell (for running shell)).
>>> 
>>> This of course still doesn't address driver problems that want to throw
>>> more stuff into front-end classpath (such as cli parser) but at least
it
>>> renders transitive luggage of mr-legacy (and the size of worker-shipped
>>> jars) much more tolerable.
>>> 
>>> How does that sound?
>> 
>> 
> 



Re: Codebase refactoring proposal

2015-01-25 Thread Dmitriy Lyubimov
-- Forwarded message --
From: "Pat Ferrel" 
Date: Jan 25, 2015 9:39 AM
Subject: Re: Codebase refactoring proposal
To: 
Cc:

> When you get a chance a PR would be good.

Yes, it would. And not just for that.

>As I understand it you are putting some class jars somewhere in the
classpath. Where? How?
>

/bin/mahout

(Computes 2 different classpaths. See  'bin/mahout classpath' vs.
'bin/mahout -spark'.)

If i interpret current shell code there correctky, legacy path tries to use
examples assemblies if not packaged, or /lib if packaged. True motivation
of that significantly predates 2010 and i suspect only Benson knows whole
true intent there.

The spark path, which is really a quick hack of the script, tries to get
only selected mahout jars and locally instlalled spark classpath which i
guess is just the shaded spark jar in recent spark releases. It also
apparently tries to include /libs/*, which is never compiled in unpackaged
version, and now i think it is a bug it is included  because /libs/* is
apparently legacy packaging, and shouldnt be used  in spark jobs with a
wildcard. I cant beleive how lazy i am, i still did not find time to
understand mahout build in all cases.

I am not even sure if packaged mahout will work with spark, honestly,
because of the /lib. Never tried that, since i mostly use application
embedding techniques.

The same solution may apply to adding external dependencies and removing
the assembly in the Spark module. Which would leave only one major build
issue afaik.
>
> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov  wrote:
>
> No, no PR. Only experiment on private. But i believe i sufficiently
defined
> what i want to do in order to gauge if we may want to advance it some time
> later. Goal is much lighter dependency for spark code. Eliminate
everything
> that is not compile-time dependent. (and a lot of it is thru legacy MR
code
> which we of course don't use).
>
> Cant say i understand the remaining issues you are talking about though.
>
> If you are talking about compiling lib or shaded assembly, no, this
doesn't
> do anything about it. Although point is, as it stands, the algebra and
> shell don't have any external dependencies but spark and these 4 (5?)
> mahout jars so they technically don't even need an assembly (as
> demonstrated).
>
> As i said, it seems driver code is the only one that may need some
external
> dependencies, but that's a different scenario from those i am talking
> about. But i am relatively happy with having the first two working nicely
> at this point.
>
> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel  wrote:
>
> > +1
> >
> > Is there a PR? You mention a "tiny mahout-hadoop” module. It would be
nice
> > to see how you’ve structured that in case we can use the same model to
> > solve the two remaining refactoring issues.
> > 1) external dependencies in the spark module
> > 2) no spark or h2o in the release artifacts.
> >
> > On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
> >
> > Also +1
> >
> > iPhone'd
> >
> >> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
> >>
> >> +1
> >>
> >>
> >> Sent from my Verizon Wireless 4G LTE smartphone
> >>
> >>  Original message From: Dmitriy
Lyubimov
> >  Date:01/23/2015  6:06 PM  (GMT-05:00)
> > To: dev@mahout.apache.org Subject: Codebase
> > refactoring proposal 
> >> 
> >> So right now mahout-spark depends on mr-legacy.
> >> I did quick refactoring and it turns out it only _irrevocably_ depends
on
> >> the following classes there:
> >>
> >> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
> > ...
> >> *sigh* o.a.m.common.Pair
> >>
> >> So  I just dropped those five classes into new a new tiny mahout-hadoop
> >> module (to signify stuff that is directly relevant to serializing
thigns
> > to
> >> DFS API) and completely removed mrlegacy and its transients from spark
> > and
> >> spark-shell dependencies.
> >>
> >> So non-cli applications (shell scripts and embedded api use) actually
> > only
> >> need spark dependencies (which come from SPARK_HOME classpath, of
course)
> >> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> >> optionally mahout-spark-shell (for running shell)).
> >>
> >> This of course still doesn't address driver problems that want to throw
> >> more stuff into front-end classpath (such as cli parser) but at least
it
> >> renders transitive luggage of mr-legacy (and the size of worker-shipped
> >> jars) much more tolerable.
> >>
> >> How does that sound?
> >
> >
>


Re: Codebase refactoring proposal

2015-01-25 Thread Pat Ferrel
When you get a chance a PR would be good. As I understand it you are putting 
some class jars somewhere in the classpath. Where? How? The same solution may 
apply to adding external dependencies and removing the assembly in the Spark 
module. Which would leave only one major build issue afaik.

On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov  wrote:

No, no PR. Only experiment on private. But i believe i sufficiently defined
what i want to do in order to gauge if we may want to advance it some time
later. Goal is much lighter dependency for spark code. Eliminate everything
that is not compile-time dependent. (and a lot of it is thru legacy MR code
which we of course don't use).

Cant say i understand the remaining issues you are talking about though.

If you are talking about compiling lib or shaded assembly, no, this doesn't
do anything about it. Although point is, as it stands, the algebra and
shell don't have any external dependencies but spark and these 4 (5?)
mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some external
dependencies, but that's a different scenario from those i am talking
about. But i am relatively happy with having the first two working nicely
at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel  wrote:

> +1
> 
> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice
> to see how you’ve structured that in case we can use the same model to
> solve the two remaining refactoring issues.
> 1) external dependencies in the spark module
> 2) no spark or h2o in the release artifacts.
> 
> On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
> 
> Also +1
> 
> iPhone'd
> 
>> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
>> 
>> +1
>> 
>> 
>> Sent from my Verizon Wireless 4G LTE smartphone
>> 
>>  Original message ----From: Dmitriy Lyubimov
>  Date:01/23/2015  6:06 PM  (GMT-05:00)
> To: dev@mahout.apache.org Subject: Codebase
> refactoring proposal 
>> 
>> So right now mahout-spark depends on mr-legacy.
>> I did quick refactoring and it turns out it only _irrevocably_ depends on
>> the following classes there:
>> 
>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
> ...
>> *sigh* o.a.m.common.Pair
>> 
>> So  I just dropped those five classes into new a new tiny mahout-hadoop
>> module (to signify stuff that is directly relevant to serializing thigns
> to
>> DFS API) and completely removed mrlegacy and its transients from spark
> and
>> spark-shell dependencies.
>> 
>> So non-cli applications (shell scripts and embedded api use) actually
> only
>> need spark dependencies (which come from SPARK_HOME classpath, of course)
>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>> optionally mahout-spark-shell (for running shell)).
>> 
>> This of course still doesn't address driver problems that want to throw
>> more stuff into front-end classpath (such as cli parser) but at least it
>> renders transitive luggage of mr-legacy (and the size of worker-shipped
>> jars) much more tolerable.
>> 
>> How does that sound?
> 
> 



Re: Codebase refactoring proposal

2015-01-24 Thread Dmitriy Lyubimov
No, no PR. Only experiment on private. But i believe i sufficiently defined
what i want to do in order to gauge if we may want to advance it some time
later. Goal is much lighter dependency for spark code. Eliminate everything
that is not compile-time dependent. (and a lot of it is thru legacy MR code
which we of course don't use).

Cant say i understand the remaining issues you are talking about though.

If you are talking about compiling lib or shaded assembly, no, this doesn't
do anything about it. Although point is, as it stands, the algebra and
shell don't have any external dependencies but spark and these 4 (5?)
mahout jars so they technically don't even need an assembly (as
demonstrated).

As i said, it seems driver code is the only one that may need some external
dependencies, but that's a different scenario from those i am talking
about. But i am relatively happy with having the first two working nicely
at this point.

On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel  wrote:

> +1
>
> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice
> to see how you’ve structured that in case we can use the same model to
> solve the two remaining refactoring issues.
> 1) external dependencies in the spark module
> 2) no spark or h2o in the release artifacts.
>
> On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:
>
> Also +1
>
> iPhone'd
>
> > On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
> >
> > +1
> >
> >
> > Sent from my Verizon Wireless 4G LTE smartphone
> >
> >  Original message ----From: Dmitriy Lyubimov
>  Date:01/23/2015  6:06 PM  (GMT-05:00)
> To: dev@mahout.apache.org Subject: Codebase
> refactoring proposal 
> > 
> > So right now mahout-spark depends on mr-legacy.
> > I did quick refactoring and it turns out it only _irrevocably_ depends on
> > the following classes there:
> >
> > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
> ...
> > *sigh* o.a.m.common.Pair
> >
> > So  I just dropped those five classes into new a new tiny mahout-hadoop
> > module (to signify stuff that is directly relevant to serializing thigns
> to
> > DFS API) and completely removed mrlegacy and its transients from spark
> and
> > spark-shell dependencies.
> >
> > So non-cli applications (shell scripts and embedded api use) actually
> only
> > need spark dependencies (which come from SPARK_HOME classpath, of course)
> > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> > optionally mahout-spark-shell (for running shell)).
> >
> > This of course still doesn't address driver problems that want to throw
> > more stuff into front-end classpath (such as cli parser) but at least it
> > renders transitive luggage of mr-legacy (and the size of worker-shipped
> > jars) much more tolerable.
> >
> > How does that sound?
>
>


Re: Codebase refactoring proposal

2015-01-24 Thread Pat Ferrel
+1

Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to 
see how you’ve structured that in case we can use the same model to solve the 
two remaining refactoring issues.
1) external dependencies in the spark module
2) no spark or h2o in the release artifacts.

On Jan 23, 2015, at 6:45 PM, Shannon Quinn  wrote:

Also +1

iPhone'd

> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
> 
> +1
> 
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
>  Original message From: Dmitriy Lyubimov 
>  Date:01/23/2015  6:06 PM  (GMT-05:00) 
> To: dev@mahout.apache.org Subject: Codebase refactoring 
> proposal 
> 
> So right now mahout-spark depends on mr-legacy.
> I did quick refactoring and it turns out it only _irrevocably_ depends on
> the following classes there:
> 
> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
> *sigh* o.a.m.common.Pair
> 
> So  I just dropped those five classes into new a new tiny mahout-hadoop
> module (to signify stuff that is directly relevant to serializing thigns to
> DFS API) and completely removed mrlegacy and its transients from spark and
> spark-shell dependencies.
> 
> So non-cli applications (shell scripts and embedded api use) actually only
> need spark dependencies (which come from SPARK_HOME classpath, of course)
> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> optionally mahout-spark-shell (for running shell)).
> 
> This of course still doesn't address driver problems that want to throw
> more stuff into front-end classpath (such as cli parser) but at least it
> renders transitive luggage of mr-legacy (and the size of worker-shipped
> jars) much more tolerable.
> 
> How does that sound?



Re: Codebase refactoring proposal

2015-01-23 Thread Shannon Quinn
Also +1

iPhone'd

> On Jan 23, 2015, at 18:38, Andrew Palumbo  wrote:
> 
> +1
> 
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
>  Original message From: Dmitriy Lyubimov 
>  Date:01/23/2015  6:06 PM  (GMT-05:00) 
> To: dev@mahout.apache.org Subject: Codebase refactoring 
> proposal 
> 
> So right now mahout-spark depends on mr-legacy.
> I did quick refactoring and it turns out it only _irrevocably_ depends on
> the following classes there:
> 
> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
> *sigh* o.a.m.common.Pair
> 
> So  I just dropped those five classes into new a new tiny mahout-hadoop
> module (to signify stuff that is directly relevant to serializing thigns to
> DFS API) and completely removed mrlegacy and its transients from spark and
> spark-shell dependencies.
> 
> So non-cli applications (shell scripts and embedded api use) actually only
> need spark dependencies (which come from SPARK_HOME classpath, of course)
> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> optionally mahout-spark-shell (for running shell)).
> 
> This of course still doesn't address driver problems that want to throw
> more stuff into front-end classpath (such as cli parser) but at least it
> renders transitive luggage of mr-legacy (and the size of worker-shipped
> jars) much more tolerable.
> 
> How does that sound?


RE: Codebase refactoring proposal

2015-01-23 Thread Andrew Palumbo
+1


Sent from my Verizon Wireless 4G LTE smartphone

 Original message From: Dmitriy Lyubimov 
 Date:01/23/2015  6:06 PM  (GMT-05:00) 
To: dev@mahout.apache.org Subject: Codebase refactoring 
proposal 

So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it turns out it only _irrevocably_ depends on
the following classes there:

MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
*sigh* o.a.m.common.Pair

So  I just dropped those five classes into new a new tiny mahout-hadoop
module (to signify stuff that is directly relevant to serializing thigns to
DFS API) and completely removed mrlegacy and its transients from spark and
spark-shell dependencies.

So non-cli applications (shell scripts and embedded api use) actually only
need spark dependencies (which come from SPARK_HOME classpath, of course)
and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
optionally mahout-spark-shell (for running shell)).

This of course still doesn't address driver problems that want to throw
more stuff into front-end classpath (such as cli parser) but at least it
renders transitive luggage of mr-legacy (and the size of worker-shipped
jars) much more tolerable.

How does that sound?


Re: Codebase refactoring proposal

2015-01-23 Thread Dmitriy Lyubimov
sorry i meant _without_ mrlegacy on classpath.

On Fri, Jan 23, 2015 at 3:31 PM, Dmitriy Lyubimov  wrote:

> And in case anyone wonders yes shell starts and runs test script totally
> fine with mrlegacy dependency on classpath (startup script modified to use
> mahout-hadoop instead)  -- both in local and distributed (standalone) mode:
>
> 
>
> $ MASTER=spark://localhost:7077 bin/mahout spark-shell
>
>  _ _
>  _ __ ___   __ _| |__   ___  _   _| |_
> | '_ ` _ \ / _` | '_ \ / _ \| | | | __|
> | | | | | | (_| | | | | (_) | |_| | |_
> |_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 1.0
>
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_71)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/01/23 15:28:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
> another address
> 15/01/23 15:28:26 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> Created spark context..
> Mahout distributed context is available as "implicit val sdc".
>
>
> mahout> :load spark-shell/src/test/mahout/simple.mscala
> Loading spark-shell/src/test/mahout/simple.mscala...
> a: org.apache.mahout.math.DenseMatrix =
> {
>   0  => {0:1.0,1:2.0,2:3.0}
>   1  => {0:3.0,1:4.0,2:5.0}
> }
> drmA: org.apache.mahout.math.drm.CheckpointedDrm[Int] =
> org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5
> drmAtA: org.apache.mahout.math.drm.DrmLike[Int] =
> OpAB(OpAt(org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5
> ),org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5)
> r: org.apache.mahout.math.drm.CheckpointedDrm[Int] =
> org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@3c46dadf
> res4: org.apache.mahout.math.Matrix =
> {
>   0  => {0:11.0,1:15.0,2:19.0}
>   1  => {0:15.0,1:21.0,2:27.0}
>   2  => {0:19.0,1:27.0,2:35.0}
> }
> mahout>
>
>
> On Fri, Jan 23, 2015 at 3:07 PM, Suneel Marthi 
> wrote:
>
>> +1
>>
>> On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov 
>> wrote:
>>
>> > So right now mahout-spark depends on mr-legacy.
>> > I did quick refactoring and it turns out it only _irrevocably_ depends
>> on
>> > the following classes there:
>> >
>> > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
>> ...
>> > *sigh* o.a.m.common.Pair
>> >
>> > So  I just dropped those five classes into new a new tiny mahout-hadoop
>> > module (to signify stuff that is directly relevant to serializing
>> thigns to
>> > DFS API) and completely removed mrlegacy and its transients from spark
>> and
>> > spark-shell dependencies.
>> >
>> > So non-cli applications (shell scripts and embedded api use) actually
>> only
>> > need spark dependencies (which come from SPARK_HOME classpath, of
>> course)
>> > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
>> > optionally mahout-spark-shell (for running shell)).
>> >
>> > This of course still doesn't address driver problems that want to throw
>> > more stuff into front-end classpath (such as cli parser) but at least it
>> > renders transitive luggage of mr-legacy (and the size of worker-shipped
>> > jars) much more tolerable.
>> >
>> > How does that sound?
>> >
>>
>
>


Re: Codebase refactoring proposal

2015-01-23 Thread Dmitriy Lyubimov
And in case anyone wonders yes shell starts and runs test script totally
fine with mrlegacy dependency on classpath (startup script modified to use
mahout-hadoop instead)  -- both in local and distributed (standalone) mode:



$ MASTER=spark://localhost:7077 bin/mahout spark-shell

 _ _
 _ __ ___   __ _| |__   ___  _   _| |_
| '_ ` _ \ / _` | '_ \ / _ \| | | | __|
| | | | | | (_| | | | | (_) | |_| | |_
|_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 1.0


Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
15/01/23 15:28:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
15/01/23 15:28:26 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Created spark context..
Mahout distributed context is available as "implicit val sdc".


mahout> :load spark-shell/src/test/mahout/simple.mscala
Loading spark-shell/src/test/mahout/simple.mscala...
a: org.apache.mahout.math.DenseMatrix =
{
  0  => {0:1.0,1:2.0,2:3.0}
  1  => {0:3.0,1:4.0,2:5.0}
}
drmA: org.apache.mahout.math.drm.CheckpointedDrm[Int] =
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5
drmAtA: org.apache.mahout.math.drm.DrmLike[Int] =
OpAB(OpAt(org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5
),org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5)
r: org.apache.mahout.math.drm.CheckpointedDrm[Int] =
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@3c46dadf
res4: org.apache.mahout.math.Matrix =
{
  0  => {0:11.0,1:15.0,2:19.0}
  1  => {0:15.0,1:21.0,2:27.0}
  2  => {0:19.0,1:27.0,2:35.0}
}
mahout>


On Fri, Jan 23, 2015 at 3:07 PM, Suneel Marthi 
wrote:

> +1
>
> On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov 
> wrote:
>
> > So right now mahout-spark depends on mr-legacy.
> > I did quick refactoring and it turns out it only _irrevocably_ depends on
> > the following classes there:
> >
> > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and
> ...
> > *sigh* o.a.m.common.Pair
> >
> > So  I just dropped those five classes into new a new tiny mahout-hadoop
> > module (to signify stuff that is directly relevant to serializing thigns
> to
> > DFS API) and completely removed mrlegacy and its transients from spark
> and
> > spark-shell dependencies.
> >
> > So non-cli applications (shell scripts and embedded api use) actually
> only
> > need spark dependencies (which come from SPARK_HOME classpath, of course)
> > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> > optionally mahout-spark-shell (for running shell)).
> >
> > This of course still doesn't address driver problems that want to throw
> > more stuff into front-end classpath (such as cli parser) but at least it
> > renders transitive luggage of mr-legacy (and the size of worker-shipped
> > jars) much more tolerable.
> >
> > How does that sound?
> >
>


Re: Codebase refactoring proposal

2015-01-23 Thread Suneel Marthi
+1

On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov  wrote:

> So right now mahout-spark depends on mr-legacy.
> I did quick refactoring and it turns out it only _irrevocably_ depends on
> the following classes there:
>
> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
> *sigh* o.a.m.common.Pair
>
> So  I just dropped those five classes into new a new tiny mahout-hadoop
> module (to signify stuff that is directly relevant to serializing thigns to
> DFS API) and completely removed mrlegacy and its transients from spark and
> spark-shell dependencies.
>
> So non-cli applications (shell scripts and embedded api use) actually only
> need spark dependencies (which come from SPARK_HOME classpath, of course)
> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
> optionally mahout-spark-shell (for running shell)).
>
> This of course still doesn't address driver problems that want to throw
> more stuff into front-end classpath (such as cli parser) but at least it
> renders transitive luggage of mr-legacy (and the size of worker-shipped
> jars) much more tolerable.
>
> How does that sound?
>


Codebase refactoring proposal

2015-01-23 Thread Dmitriy Lyubimov
So right now mahout-spark depends on mr-legacy.
I did quick refactoring and it turns out it only _irrevocably_ depends on
the following classes there:

MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ...
*sigh* o.a.m.common.Pair

So  I just dropped those five classes into new a new tiny mahout-hadoop
module (to signify stuff that is directly relevant to serializing thigns to
DFS API) and completely removed mrlegacy and its transients from spark and
spark-shell dependencies.

So non-cli applications (shell scripts and embedded api use) actually only
need spark dependencies (which come from SPARK_HOME classpath, of course)
and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and
optionally mahout-spark-shell (for running shell)).

This of course still doesn't address driver problems that want to throw
more stuff into front-end classpath (such as cli parser) but at least it
renders transitive luggage of mr-legacy (and the size of worker-shipped
jars) much more tolerable.

How does that sound?