Re: Codebase refactoring proposal
I dont know why. I said i didnt see either as a problem. As far as i am concerned. Had encountered both needs in the past, did not even notice it was a problem. Both are not relevant to this thread. Not sure. Id suggest starting a separate thread. Speaking of my priorities, two biggest problems i see is in-core performance and tons of archaic dependencies. But only one belongs here. 3rd biggest problem is general bugs and code tidiness. On Feb 8, 2015 8:22 PM, "Pat Ferrel" wrote: > OK, well perhaps those two lines of code (actually I agree, there’s not > much more) can be also applied to TF-IDF and several other algorithms to > get a much higher level or interoperability and keep us from reinventing > things when not necessary. Funny we have type conversions for so many > things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in > but it does solve problems we don’t need to reinvent. Frankly adopting the > best of MLlib makes Mahout a superset along with all its other virtues. > > And yes, I forgot to also praise the DSL’s optimizer—now rectified. > > Why do we spend more time with engine agnostic decisions that these more > pragmatic ones? > > > On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov wrote: > > The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans > application and conversion back is another line. I actually did that some > time ago. I am sure you can figure the details. > > Whether it is worth to retain some commonality, no, it is not worth it > untill there's commonality across mllib. > > At which point we may just include conversions for those who is interested. > Until then all we can do is to maintain commonality with mllib kmeans > specifically but not mllib as a whole. > On Feb 8, 2015 7:45 PM, "Pat Ferrel" wrote: > > > I completely understand that MLlib lacks anything like the completeness > of > > Mahout's DSL, I know of no other scalable solution to match. I don’t > know > > how many times this has to be said. This is something we can all get > behind > > as *unique* to Mahout. > > > > But I stand by the statement that there should also be some lower level > > data commonality. There is too much similarity to dismiss and go > completely > > non-overlapping ways. Even if you can ague for maintaining separate > > parallel ways let’s have some type conversions (I hesitate to say easy to > > use) They shouldn’t be all that hard. > > > > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would > > solve my Kmeans use case. You know MLlib better than I so choose the best > > level to perform type conversions or inheritance splicing. The point is > to > > make the two as seamless as possible. Doesn’t this seem a worthy goal? > > > > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov wrote: > > > > Pat, > > > > I *just* made a case in this thread explaining that mllib does not have a > > single distributed matrix types and that its own methodologies do not > > interoperate within itself for that reason. Therefore, it is > fundamentally > > impossible to be interoperable with mllib since nobody really can define > > what it means in terms of distributed types. > > > > You are in fact referring to their in-core type, not a distributed type. > > But there's no linear algebra operation support to speak of there either. > > It is, simply, not algebra, at the moment. The types in this hierarchy > are > > just memory storage models, and private scope converters to breeze > storage > > models, but they are not true linalg apis nor providers of such. > > > > One might concievably want to standardize on Breeze apis since those are > > both linalg api and providers, but not the type you've been mentioning. > > > > However, it is not a very happy path either. Breeze is somewhat more > > interesting substrate to build in-core operations on, but if you read > spark > > forum of late, even spark developers express a whiff of dissatisfaction > > with it in favor of BIDMat (me too btw). But while they say Bidmat would > be > > a better choice for in-core operatros, they also recognize the fact that > > they are too invested into breeze api by now and such move would not be > > cheap across the board. > > > > And that demonstrates another problem on in-core mllib architectrue > there: > > on one side, they don't have sufficient public in-core dsl or api to > speak > > of; but they also do not have a sufficiently abstract api for in-core > blas > > plugins either to be truly agnostic of the available in-core > methodologies. > > > > So what you are talking about, is simply not possible with current state > of > > things there. But if it were, i'd just suggest you to try to port > algebraic > > things you like in Mahout, to mllib. > > > > My guess however is that you'd find that porting algebraic optimizer with > > proper level of consistency with in-core operations will not be easy for > > reasons including, but not limited to, the ones i just mentioned; > alth
Re: Codebase refactoring proposal
OK, well perhaps those two lines of code (actually I agree, there’s not much more) can be also applied to TF-IDF and several other algorithms to get a much higher level or interoperability and keep us from reinventing things when not necessary. Funny we have type conversions for so many things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in but it does solve problems we don’t need to reinvent. Frankly adopting the best of MLlib makes Mahout a superset along with all its other virtues. And yes, I forgot to also praise the DSL’s optimizer—now rectified. Why do we spend more time with engine agnostic decisions that these more pragmatic ones? On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov wrote: The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans application and conversion back is another line. I actually did that some time ago. I am sure you can figure the details. Whether it is worth to retain some commonality, no, it is not worth it untill there's commonality across mllib. At which point we may just include conversions for those who is interested. Until then all we can do is to maintain commonality with mllib kmeans specifically but not mllib as a whole. On Feb 8, 2015 7:45 PM, "Pat Ferrel" wrote: > I completely understand that MLlib lacks anything like the completeness of > Mahout's DSL, I know of no other scalable solution to match. I don’t know > how many times this has to be said. This is something we can all get behind > as *unique* to Mahout. > > But I stand by the statement that there should also be some lower level > data commonality. There is too much similarity to dismiss and go completely > non-overlapping ways. Even if you can ague for maintaining separate > parallel ways let’s have some type conversions (I hesitate to say easy to > use) They shouldn’t be all that hard. > > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would > solve my Kmeans use case. You know MLlib better than I so choose the best > level to perform type conversions or inheritance splicing. The point is to > make the two as seamless as possible. Doesn’t this seem a worthy goal? > > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov wrote: > > Pat, > > I *just* made a case in this thread explaining that mllib does not have a > single distributed matrix types and that its own methodologies do not > interoperate within itself for that reason. Therefore, it is fundamentally > impossible to be interoperable with mllib since nobody really can define > what it means in terms of distributed types. > > You are in fact referring to their in-core type, not a distributed type. > But there's no linear algebra operation support to speak of there either. > It is, simply, not algebra, at the moment. The types in this hierarchy are > just memory storage models, and private scope converters to breeze storage > models, but they are not true linalg apis nor providers of such. > > One might concievably want to standardize on Breeze apis since those are > both linalg api and providers, but not the type you've been mentioning. > > However, it is not a very happy path either. Breeze is somewhat more > interesting substrate to build in-core operations on, but if you read spark > forum of late, even spark developers express a whiff of dissatisfaction > with it in favor of BIDMat (me too btw). But while they say Bidmat would be > a better choice for in-core operatros, they also recognize the fact that > they are too invested into breeze api by now and such move would not be > cheap across the board. > > And that demonstrates another problem on in-core mllib architectrue there: > on one side, they don't have sufficient public in-core dsl or api to speak > of; but they also do not have a sufficiently abstract api for in-core blas > plugins either to be truly agnostic of the available in-core methodologies. > > So what you are talking about, is simply not possible with current state of > things there. But if it were, i'd just suggest you to try to port algebraic > things you like in Mahout, to mllib. > > My guess however is that you'd find that porting algebraic optimizer with > proper level of consistency with in-core operations will not be easy for > reasons including, but not limited to, the ones i just mentioned; although > individual blas like matrix square you've mentioned would be fairly easy > to do for one of the distributed matrix types in mllib. But that of course > would not be an R like environment and not an optimizer. > > I like bidmat a lot though; but it is not truly hybrid and self-adjusting > environment for in-core operations either (and its dsl is neither Rlike nor > matlab like, so it takes a bit of adjusting to). For that reason even > Bidmat linalg types and dsl are not truly versatile enough for our (well, > my anyway) purposes (which are to find the best hardware or software > subroutine automatically given current hardware and software pl
Re: Codebase refactoring proposal
The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans application and conversion back is another line. I actually did that some time ago. I am sure you can figure the details. Whether it is worth to retain some commonality, no, it is not worth it untill there's commonality across mllib. At which point we may just include conversions for those who is interested. Until then all we can do is to maintain commonality with mllib kmeans specifically but not mllib as a whole. On Feb 8, 2015 7:45 PM, "Pat Ferrel" wrote: > I completely understand that MLlib lacks anything like the completeness of > Mahout's DSL, I know of no other scalable solution to match. I don’t know > how many times this has to be said. This is something we can all get behind > as *unique* to Mahout. > > But I stand by the statement that there should also be some lower level > data commonality. There is too much similarity to dismiss and go completely > non-overlapping ways. Even if you can ague for maintaining separate > parallel ways let’s have some type conversions (I hesitate to say easy to > use) They shouldn’t be all that hard. > > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would > solve my Kmeans use case. You know MLlib better than I so choose the best > level to perform type conversions or inheritance splicing. The point is to > make the two as seamless as possible. Doesn’t this seem a worthy goal? > > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov wrote: > > Pat, > > I *just* made a case in this thread explaining that mllib does not have a > single distributed matrix types and that its own methodologies do not > interoperate within itself for that reason. Therefore, it is fundamentally > impossible to be interoperable with mllib since nobody really can define > what it means in terms of distributed types. > > You are in fact referring to their in-core type, not a distributed type. > But there's no linear algebra operation support to speak of there either. > It is, simply, not algebra, at the moment. The types in this hierarchy are > just memory storage models, and private scope converters to breeze storage > models, but they are not true linalg apis nor providers of such. > > One might concievably want to standardize on Breeze apis since those are > both linalg api and providers, but not the type you've been mentioning. > > However, it is not a very happy path either. Breeze is somewhat more > interesting substrate to build in-core operations on, but if you read spark > forum of late, even spark developers express a whiff of dissatisfaction > with it in favor of BIDMat (me too btw). But while they say Bidmat would be > a better choice for in-core operatros, they also recognize the fact that > they are too invested into breeze api by now and such move would not be > cheap across the board. > > And that demonstrates another problem on in-core mllib architectrue there: > on one side, they don't have sufficient public in-core dsl or api to speak > of; but they also do not have a sufficiently abstract api for in-core blas > plugins either to be truly agnostic of the available in-core methodologies. > > So what you are talking about, is simply not possible with current state of > things there. But if it were, i'd just suggest you to try to port algebraic > things you like in Mahout, to mllib. > > My guess however is that you'd find that porting algebraic optimizer with > proper level of consistency with in-core operations will not be easy for > reasons including, but not limited to, the ones i just mentioned; although > individual blas like matrix square you've mentioned would be fairly easy > to do for one of the distributed matrix types in mllib. But that of course > would not be an R like environment and not an optimizer. > > I like bidmat a lot though; but it is not truly hybrid and self-adjusting > environment for in-core operations either (and its dsl is neither Rlike nor > matlab like, so it takes a bit of adjusting to). For that reason even > Bidmat linalg types and dsl are not truly versatile enough for our (well, > my anyway) purposes (which are to find the best hardware or software > subroutine automatically given current hardware and software platform > architecture and parameters of the requested operation). > On Feb 8, 2015 9:05 AM, "Pat Ferrel" wrote: > > > Why aren’t we using linalg.Vector and its siblings? The same could be > > asked for linalg.Matrix. If we want to prune dependencies this would help > > and would also significantly increase interoperability. > > > > Case-now: I have a real need to cluster items in a CF type input matrix. > > The input matrix A’ has row of items. I need to drop this into a sequence > > file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an > > RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad > > and maybe could be helped with some implicit conversions mahout.Vector > <-> > > linalg.Vector (mayb
Re: Codebase refactoring proposal
I completely understand that MLlib lacks anything like the completeness of Mahout's DSL, I know of no other scalable solution to match. I don’t know how many times this has to be said. This is something we can all get behind as *unique* to Mahout. But I stand by the statement that there should also be some lower level data commonality. There is too much similarity to dismiss and go completely non-overlapping ways. Even if you can ague for maintaining separate parallel ways let’s have some type conversions (I hesitate to say easy to use) They shouldn’t be all that hard. A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would solve my Kmeans use case. You know MLlib better than I so choose the best level to perform type conversions or inheritance splicing. The point is to make the two as seamless as possible. Doesn’t this seem a worthy goal? On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov wrote: Pat, I *just* made a case in this thread explaining that mllib does not have a single distributed matrix types and that its own methodologies do not interoperate within itself for that reason. Therefore, it is fundamentally impossible to be interoperable with mllib since nobody really can define what it means in terms of distributed types. You are in fact referring to their in-core type, not a distributed type. But there's no linear algebra operation support to speak of there either. It is, simply, not algebra, at the moment. The types in this hierarchy are just memory storage models, and private scope converters to breeze storage models, but they are not true linalg apis nor providers of such. One might concievably want to standardize on Breeze apis since those are both linalg api and providers, but not the type you've been mentioning. However, it is not a very happy path either. Breeze is somewhat more interesting substrate to build in-core operations on, but if you read spark forum of late, even spark developers express a whiff of dissatisfaction with it in favor of BIDMat (me too btw). But while they say Bidmat would be a better choice for in-core operatros, they also recognize the fact that they are too invested into breeze api by now and such move would not be cheap across the board. And that demonstrates another problem on in-core mllib architectrue there: on one side, they don't have sufficient public in-core dsl or api to speak of; but they also do not have a sufficiently abstract api for in-core blas plugins either to be truly agnostic of the available in-core methodologies. So what you are talking about, is simply not possible with current state of things there. But if it were, i'd just suggest you to try to port algebraic things you like in Mahout, to mllib. My guess however is that you'd find that porting algebraic optimizer with proper level of consistency with in-core operations will not be easy for reasons including, but not limited to, the ones i just mentioned; although individual blas like matrix square you've mentioned would be fairly easy to do for one of the distributed matrix types in mllib. But that of course would not be an R like environment and not an optimizer. I like bidmat a lot though; but it is not truly hybrid and self-adjusting environment for in-core operations either (and its dsl is neither Rlike nor matlab like, so it takes a bit of adjusting to). For that reason even Bidmat linalg types and dsl are not truly versatile enough for our (well, my anyway) purposes (which are to find the best hardware or software subroutine automatically given current hardware and software platform architecture and parameters of the requested operation). On Feb 8, 2015 9:05 AM, "Pat Ferrel" wrote: > Why aren’t we using linalg.Vector and its siblings? The same could be > asked for linalg.Matrix. If we want to prune dependencies this would help > and would also significantly increase interoperability. > > Case-now: I have a real need to cluster items in a CF type input matrix. > The input matrix A’ has row of items. I need to drop this into a sequence > file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an > RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad > and maybe could be helped with some implicit conversions mahout.Vector <-> > linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for > Kmeans). > > Case-possible: If we adopted linalg.Vector as the native format and > perhaps even linalg.Matrix this would give immediate interoperability in > some areas including my specific need. It would significantly pare down > dependencies not provided by the environment (Mahout-math). It would also > support creating distributed computation methods that would work on MLlib > and Mahout datasets addressing Gokhan’s question. > > I looked at another “Case-now” possibility, which was to go all MLlib with > item similarity. I found that MLlib doesn’t have a transpose—“transpose, > why would you want to do th
Re: Codebase refactoring proposal
Pat, I *just* made a case in this thread explaining that mllib does not have a single distributed matrix types and that its own methodologies do not interoperate within itself for that reason. Therefore, it is fundamentally impossible to be interoperable with mllib since nobody really can define what it means in terms of distributed types. You are in fact referring to their in-core type, not a distributed type. But there's no linear algebra operation support to speak of there either. It is, simply, not algebra, at the moment. The types in this hierarchy are just memory storage models, and private scope converters to breeze storage models, but they are not true linalg apis nor providers of such. One might concievably want to standardize on Breeze apis since those are both linalg api and providers, but not the type you've been mentioning. However, it is not a very happy path either. Breeze is somewhat more interesting substrate to build in-core operations on, but if you read spark forum of late, even spark developers express a whiff of dissatisfaction with it in favor of BIDMat (me too btw). But while they say Bidmat would be a better choice for in-core operatros, they also recognize the fact that they are too invested into breeze api by now and such move would not be cheap across the board. And that demonstrates another problem on in-core mllib architectrue there: on one side, they don't have sufficient public in-core dsl or api to speak of; but they also do not have a sufficiently abstract api for in-core blas plugins either to be truly agnostic of the available in-core methodologies. So what you are talking about, is simply not possible with current state of things there. But if it were, i'd just suggest you to try to port algebraic things you like in Mahout, to mllib. My guess however is that you'd find that porting algebraic optimizer with proper level of consistency with in-core operations will not be easy for reasons including, but not limited to, the ones i just mentioned; although individual blas like matrix square you've mentioned would be fairly easy to do for one of the distributed matrix types in mllib. But that of course would not be an R like environment and not an optimizer. I like bidmat a lot though; but it is not truly hybrid and self-adjusting environment for in-core operations either (and its dsl is neither Rlike nor matlab like, so it takes a bit of adjusting to). For that reason even Bidmat linalg types and dsl are not truly versatile enough for our (well, my anyway) purposes (which are to find the best hardware or software subroutine automatically given current hardware and software platform architecture and parameters of the requested operation). On Feb 8, 2015 9:05 AM, "Pat Ferrel" wrote: > Why aren’t we using linalg.Vector and its siblings? The same could be > asked for linalg.Matrix. If we want to prune dependencies this would help > and would also significantly increase interoperability. > > Case-now: I have a real need to cluster items in a CF type input matrix. > The input matrix A’ has row of items. I need to drop this into a sequence > file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an > RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad > and maybe could be helped with some implicit conversions mahout.Vector <-> > linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for > Kmeans). > > Case-possible: If we adopted linalg.Vector as the native format and > perhaps even linalg.Matrix this would give immediate interoperability in > some areas including my specific need. It would significantly pare down > dependencies not provided by the environment (Mahout-math). It would also > support creating distributed computation methods that would work on MLlib > and Mahout datasets addressing Gokhan’s question. > > I looked at another “Case-now” possibility, which was to go all MLlib with > item similarity. I found that MLlib doesn’t have a transpose—“transpose, > why would you want to do that?” Not even in the multiply form A’A, A’B, > AA’, all used in item and row similarity. That stopped me from looking > deeper. > > The strength and unique value of Mahout is the completeness of its > generalized linear algebra DSL. But insistence on using Mahout specific > data types is also a barrier for Spark people adopting the DSL. Not having > lower level interoperability is a barrier both ways to mixing Mahout and > MLlib—creating unnecessary either/or choices for devs. > > On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov wrote: > > On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan wrote: > > > What I am saying is that for certain algorithms including both > > engine-specific (such as aggregation) and DSL stuff, what is the best way > > of handling them? > > > > i) should we add the distributed operations to Mahout codebase as it is > > proposed in #62? > > > > Imo this can't go very well and very far (because of the engine spec
Re: Codebase refactoring proposal
Why aren’t we using linalg.Vector and its siblings? The same could be asked for linalg.Matrix. If we want to prune dependencies this would help and would also significantly increase interoperability. Case-now: I have a real need to cluster items in a CF type input matrix. The input matrix A’ has row of items. I need to drop this into a sequence file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad and maybe could be helped with some implicit conversions mahout.Vector <-> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for Kmeans). Case-possible: If we adopted linalg.Vector as the native format and perhaps even linalg.Matrix this would give immediate interoperability in some areas including my specific need. It would significantly pare down dependencies not provided by the environment (Mahout-math). It would also support creating distributed computation methods that would work on MLlib and Mahout datasets addressing Gokhan’s question. I looked at another “Case-now” possibility, which was to go all MLlib with item similarity. I found that MLlib doesn’t have a transpose—“transpose, why would you want to do that?” Not even in the multiply form A’A, A’B, AA’, all used in item and row similarity. That stopped me from looking deeper. The strength and unique value of Mahout is the completeness of its generalized linear algebra DSL. But insistence on using Mahout specific data types is also a barrier for Spark people adopting the DSL. Not having lower level interoperability is a barrier both ways to mixing Mahout and MLlib—creating unnecessary either/or choices for devs. On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov wrote: On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan wrote: > What I am saying is that for certain algorithms including both > engine-specific (such as aggregation) and DSL stuff, what is the best way > of handling them? > > i) should we add the distributed operations to Mahout codebase as it is > proposed in #62? > Imo this can't go very well and very far (because of the engine specifics) but i'd be willing to see an experiment with simple things like map and reduce. Bigger quesitons are, where exactly we'll have to stop (we can't abstract all capabilities out there becuase of "common denominator" issues), and what percentage of methods will it truly allow to migrate to full backend portability. And if after doing all this, we will still find ourselves writing engine specific mixes, why bother. Wouldn't it be better to find a good, easy-to-replicate, incrementally-developed pattern to register and apply engine-specific strategies for every method? > > ii) should we have [engine]-ml modules (like spark-bindings and > h2o-bindings) where we can mix the DSL and engine-specific stuff? > This is not quite what i am proposing. Rather, engine-ml modules holding engine-specific _parts_ of algorithm. However, this really needs a POC over a guniea pig (similarly to how we POC'd algebra in the first place with ssvd and spca). > >
Re: Codebase refactoring proposal
On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan wrote: > What I am saying is that for certain algorithms including both > engine-specific (such as aggregation) and DSL stuff, what is the best way > of handling them? > > i) should we add the distributed operations to Mahout codebase as it is > proposed in #62? > Imo this can't go very well and very far (because of the engine specifics) but i'd be willing to see an experiment with simple things like map and reduce. Bigger quesitons are, where exactly we'll have to stop (we can't abstract all capabilities out there becuase of "common denominator" issues), and what percentage of methods will it truly allow to migrate to full backend portability. And if after doing all this, we will still find ourselves writing engine specific mixes, why bother. Wouldn't it be better to find a good, easy-to-replicate, incrementally-developed pattern to register and apply engine-specific strategies for every method? > > ii) should we have [engine]-ml modules (like spark-bindings and > h2o-bindings) where we can mix the DSL and engine-specific stuff? > This is not quite what i am proposing. Rather, engine-ml modules holding engine-specific _parts_ of algorithm. However, this really needs a POC over a guniea pig (similarly to how we POC'd algebra in the first place with ssvd and spca). > >
Re: Codebase refactoring proposal
From my own perspective: I’m not aware of any rule to make all operations agnostic. In fact several engine specific exceptions are discussed in this long email. We’ve talked about reduce or join operations that would be difficult to make agnostic without a lot of knowledge of ALL other engines. Unless or until we get contributors from those engines reviewing commits, why put this burden on all of us? An agnostic DSL was for linear algebra ops, not all distributed computation methods. We aren’t doing a generic engine only engine agnostic algebra. You have added stubs in H2O for the distributed aggregations. This seems fine but I wouldn’t vote to require that. If GSGD requires further use of Spark specific operations, so be it. This means that GSGD may live in the Spark module with any algebra bits required added to math-scala. Does anyone have a problem with that? My vote on #62—ship it. On the point of interoperability with MLlib we still need talk about that but another email. On Feb 5, 2015, at 1:14 AM, Gokhan Capan wrote: What I am saying is that for certain algorithms including both engine-specific (such as aggregation) and DSL stuff, what is the best way of handling them? i) should we add the distributed operations to Mahout codebase as it is proposed in #62? ii) should we have [engine]-ml modules (like spark-bindings and h2o-bindings) where we can mix the DSL and engine-specific stuff? Picking i. has the advantage of writing an ML-algorithm once and then it can be run on alternative engines, but it requires wrapping/duplicating existing distributed operations. Picking ii. has the advantage of avoiding writing distributed operations, but since we're mixing the DSL and the engine-specific stuff, an ML-algorithm written for an engine would not be available for the others. I just wanted to hear some opinions. Gokhan On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov wrote: > I took it Gokhan had objections himself, based on his comments. if we are > talking about #62. > > He also expressed concerns about computing GSGD but i suspect it can still > be algebraically computed. > > On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel wrote: > >> BTW Ted and Andrew have both expressed interest in the distributed >> aggregation stuff. It sounds like we are agreeing that >> non-algebra—computation method type things can be engine specific. >> >> So does anyone have an objection to Gokhan pushing his PR? >> >> On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov wrote: >> >> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo > wrote: >> >>> >>> >>> >>> My thought was not to bring primitive engine specific aggregetors, >>> combiners, etc. into math-scala. >>> >> >> Yeah. +1. I would like to support that as an experiment, see where it > goes. >> Clearly some distributed use cases are simple enough while also pervasive >> enough. >> >> >
Re: Codebase refactoring proposal
What I am saying is that for certain algorithms including both engine-specific (such as aggregation) and DSL stuff, what is the best way of handling them? i) should we add the distributed operations to Mahout codebase as it is proposed in #62? ii) should we have [engine]-ml modules (like spark-bindings and h2o-bindings) where we can mix the DSL and engine-specific stuff? Picking i. has the advantage of writing an ML-algorithm once and then it can be run on alternative engines, but it requires wrapping/duplicating existing distributed operations. Picking ii. has the advantage of avoiding writing distributed operations, but since we're mixing the DSL and the engine-specific stuff, an ML-algorithm written for an engine would not be available for the others. I just wanted to hear some opinions. Gokhan On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov wrote: > I took it Gokhan had objections himself, based on his comments. if we are > talking about #62. > > He also expressed concerns about computing GSGD but i suspect it can still > be algebraically computed. > > On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel wrote: > > > BTW Ted and Andrew have both expressed interest in the distributed > > aggregation stuff. It sounds like we are agreeing that > > non-algebra—computation method type things can be engine specific. > > > > So does anyone have an objection to Gokhan pushing his PR? > > > > On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov wrote: > > > > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo > wrote: > > > > > > > > > > > > > > My thought was not to bring primitive engine specific aggregetors, > > > combiners, etc. into math-scala. > > > > > > > Yeah. +1. I would like to support that as an experiment, see where it > goes. > > Clearly some distributed use cases are simple enough while also pervasive > > enough. > > > > >
Re: Codebase refactoring proposal
I took it Gokhan had objections himself, based on his comments. if we are talking about #62. He also expressed concerns about computing GSGD but i suspect it can still be algebraically computed. On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel wrote: > BTW Ted and Andrew have both expressed interest in the distributed > aggregation stuff. It sounds like we are agreeing that > non-algebra—computation method type things can be engine specific. > > So does anyone have an objection to Gokhan pushing his PR? > > On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov wrote: > > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote: > > > > > > > > > My thought was not to bring primitive engine specific aggregetors, > > combiners, etc. into math-scala. > > > > Yeah. +1. I would like to support that as an experiment, see where it goes. > Clearly some distributed use cases are simple enough while also pervasive > enough. > >
Re: Codebase refactoring proposal
BTW Ted and Andrew have both expressed interest in the distributed aggregation stuff. It sounds like we are agreeing that non-algebra—computation method type things can be engine specific. So does anyone have an objection to Gokhan pushing his PR? On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov wrote: On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote: > > > > My thought was not to bring primitive engine specific aggregetors, > combiners, etc. into math-scala. > Yeah. +1. I would like to support that as an experiment, see where it goes. Clearly some distributed use cases are simple enough while also pervasive enough.
Re: Codebase refactoring proposal
On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote: > > > > My thought was not to bring primitive engine specific aggregetors, > combiners, etc. into math-scala. > Yeah. +1. I would like to support that as an experiment, see where it goes. Clearly some distributed use cases are simple enough while also pervasive enough.
Re: Codebase refactoring proposal
But also keep in mind that Flink folks are eager to allocate resources for ML work. So maybe that's the way to work it -- create a DataFrame-based seq2sparse port and then just hand it off to them to add to either Flink directly (but with DRM output), or as a part of Mahout. On Wed, Feb 4, 2015 at 2:07 PM, Dmitriy Lyubimov wrote: > Spark's DataFrame is obviously not agnostic. > > I don't believe there's a good way to abstract it. Unfortunately. I think > getting too much into distributed operation abstraction is a bit dangerous. > > I think MLI was one project that attempted to do that -- but it did not > take off i guess. or at least there were 0 commits in like 18 months there > if i am not mistaken, and it never made it into spark tree. > > So it is a good question. if we need a dataframe in flink, what do we do. > I am open to suggestions. I very much don't want to do "yet another > abstract language-integrated Spark SQL" feature. > > Given resources, IMO it'd be better to take on fewer goals but make them > shine. So i'd do spark-based seq2sparse version first and that'd give some > ideas how to create ports/abstractions of that work to Flink. > > > > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote: > >> >> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote: >> >>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it >>> there since they are going beyond the scope of that PR's work to chase >>> the >>> root of the issue. >>> >>> on quasi-algebraic methods >>> >>> >>> What is the dilemma here? don't see any. >>> >>> I already explained that no more than 25% of algorithms are truly 100% >>> algebraic. But about 80% cannot avoid using some algebra and close to 95% >>> could benefit from using algebra (even stochastic and monte carlo stuff). >>> >>> So we are building system that allows us to cut developer's work by at >>> least 60% and make his work also more readable by 3000%. As far as I am >>> concerned, that fulfills the goal. And I am perfectly happy writing a mix >>> of engine-specific primitives and algebra. >>> >>> That's why i am a bit skeptical about attempts to abstract non-algebraic >>> primitives such as row-wise aggregators in one of the pull requests. >>> Engine-specific primitives and algebra can perfectly co-exist in the >>> guts. >>> And that's how i am doing my stuff in practice, except i now can skip 80% >>> effort on algebra and bridging incompatible intputs-outputs. >>> >> I am **definitely** not advocating messing with the algebraic optimizer. >> That was what I saw as the plus side to Gokhan's PR- a separate engine >> abstraction for qasi/non-algebraic distributed methods. I didn't comment >> on the PR either because admittedly I did not have a chance to spend a lot >> of time on it. But my quick takeaway was that we could take some very >> useful and hopefully (close to) ubiquitous distributed operators and pass >> them through to the engine "guts". >> >> I briefly looked through some of the flink and h2o code and noticed >> Flink's aggregateOperator [1] >> and h2o's MapReduce API and [2] my thought was that we could write pass >> through operators for some of the more useful operations from math-scala >> and then implement them fully in their respective packages. Though I am >> not sure how this would work on either cases w.r.t. partitioning. e.g. on >> h2o's distributed DataFrame. or flink for that matter. Again, I havent had >> alot of time to look at these and see if this would work at all. >> >> My thought was not to bring primitive engine specific aggregetors, >> combiners, etc. into math-scala. >> >> I had thought though that we were trying to develop a fully engine >> agnostic algorithm library in on top of the R-Like distributed BLAS. >> >> >> So would the idea be to implement i.e. seq2sparse fully in the spark >> module? It would seem to fracture the project a bit. >> >> >> Or to implement algorithms sequentially if mapBlock() will not suffice >> and then optimize them in their respective modules? >> >> >> >> >>> None of that means that R-like algebra cannot be engine agnostic. So >>> people >>> are unhappy about not being able to write the whole in totaly agnostic >>> way? >>> And so they (falsely) infer the pieces of their work cannot be helped by >>> agnosticism individually, or the tools are not being as good as they >>> might >>> be without backend agnosticism? Sorry, but I fail to see the logic there. >>> >>> We proved algebra can be agnostic. I don't think this notion should be >>> disputed. >>> >>> And even if there were a shred of real benefit by making algebra tools >>> un-agnostic, it would not ever outweigh tons of good we could get for the >>> project by integrating with e.g. Flink folks. This one one the points >>> MLLib >>> will never be able to overcome -- to be truly shared ML platform where >>> people could create and share ML, but not just a bunch of ad-hoc >>> spaghetty >>> of distributed api
Re: Codebase refactoring proposal
Spark's DataFrame is obviously not agnostic. I don't believe there's a good way to abstract it. Unfortunately. I think getting too much into distributed operation abstraction is a bit dangerous. I think MLI was one project that attempted to do that -- but it did not take off i guess. or at least there were 0 commits in like 18 months there if i am not mistaken, and it never made it into spark tree. So it is a good question. if we need a dataframe in flink, what do we do. I am open to suggestions. I very much don't want to do "yet another abstract language-integrated Spark SQL" feature. Given resources, IMO it'd be better to take on fewer goals but make them shine. So i'd do spark-based seq2sparse version first and that'd give some ideas how to create ports/abstractions of that work to Flink. On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo wrote: > > On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote: > >> Re: Gokhan's PR post: here are my thoughts but i did not want to post it >> there since they are going beyond the scope of that PR's work to chase the >> root of the issue. >> >> on quasi-algebraic methods >> >> >> What is the dilemma here? don't see any. >> >> I already explained that no more than 25% of algorithms are truly 100% >> algebraic. But about 80% cannot avoid using some algebra and close to 95% >> could benefit from using algebra (even stochastic and monte carlo stuff). >> >> So we are building system that allows us to cut developer's work by at >> least 60% and make his work also more readable by 3000%. As far as I am >> concerned, that fulfills the goal. And I am perfectly happy writing a mix >> of engine-specific primitives and algebra. >> >> That's why i am a bit skeptical about attempts to abstract non-algebraic >> primitives such as row-wise aggregators in one of the pull requests. >> Engine-specific primitives and algebra can perfectly co-exist in the guts. >> And that's how i am doing my stuff in practice, except i now can skip 80% >> effort on algebra and bridging incompatible intputs-outputs. >> > I am **definitely** not advocating messing with the algebraic optimizer. > That was what I saw as the plus side to Gokhan's PR- a separate engine > abstraction for qasi/non-algebraic distributed methods. I didn't comment > on the PR either because admittedly I did not have a chance to spend a lot > of time on it. But my quick takeaway was that we could take some very > useful and hopefully (close to) ubiquitous distributed operators and pass > them through to the engine "guts". > > I briefly looked through some of the flink and h2o code and noticed > Flink's aggregateOperator [1] > and h2o's MapReduce API and [2] my thought was that we could write pass > through operators for some of the more useful operations from math-scala > and then implement them fully in their respective packages. Though I am > not sure how this would work on either cases w.r.t. partitioning. e.g. on > h2o's distributed DataFrame. or flink for that matter. Again, I havent had > alot of time to look at these and see if this would work at all. > > My thought was not to bring primitive engine specific aggregetors, > combiners, etc. into math-scala. > > I had thought though that we were trying to develop a fully engine > agnostic algorithm library in on top of the R-Like distributed BLAS. > > > So would the idea be to implement i.e. seq2sparse fully in the spark > module? It would seem to fracture the project a bit. > > > Or to implement algorithms sequentially if mapBlock() will not suffice and > then optimize them in their respective modules? > > > > >> None of that means that R-like algebra cannot be engine agnostic. So >> people >> are unhappy about not being able to write the whole in totaly agnostic >> way? >> And so they (falsely) infer the pieces of their work cannot be helped by >> agnosticism individually, or the tools are not being as good as they might >> be without backend agnosticism? Sorry, but I fail to see the logic there. >> >> We proved algebra can be agnostic. I don't think this notion should be >> disputed. >> >> And even if there were a shred of real benefit by making algebra tools >> un-agnostic, it would not ever outweigh tons of good we could get for the >> project by integrating with e.g. Flink folks. This one one the points >> MLLib >> will never be able to overcome -- to be truly shared ML platform where >> people could create and share ML, but not just a bunch of ad-hoc spaghetty >> of distributed api calls and Spark-nailed black boxes. >> >> Well yes methodology implementations will still have native distributed >> calls. Just not nearly as many as they otherwise would, and will be much >> more easier to support on another back-end using Strategy patterns. E.g. >> implicit feedback problem that i originally wrote as quasi-method for >> Spark >> only, would've taken just an hour or so to add strategy for flink, since >> it >> retains all in-core and distri
Re: Codebase refactoring proposal
On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote: Re: Gokhan's PR post: here are my thoughts but i did not want to post it there since they are going beyond the scope of that PR's work to chase the root of the issue. on quasi-algebraic methods What is the dilemma here? don't see any. I already explained that no more than 25% of algorithms are truly 100% algebraic. But about 80% cannot avoid using some algebra and close to 95% could benefit from using algebra (even stochastic and monte carlo stuff). So we are building system that allows us to cut developer's work by at least 60% and make his work also more readable by 3000%. As far as I am concerned, that fulfills the goal. And I am perfectly happy writing a mix of engine-specific primitives and algebra. That's why i am a bit skeptical about attempts to abstract non-algebraic primitives such as row-wise aggregators in one of the pull requests. Engine-specific primitives and algebra can perfectly co-exist in the guts. And that's how i am doing my stuff in practice, except i now can skip 80% effort on algebra and bridging incompatible intputs-outputs. I am **definitely** not advocating messing with the algebraic optimizer. That was what I saw as the plus side to Gokhan's PR- a separate engine abstraction for qasi/non-algebraic distributed methods. I didn't comment on the PR either because admittedly I did not have a chance to spend a lot of time on it. But my quick takeaway was that we could take some very useful and hopefully (close to) ubiquitous distributed operators and pass them through to the engine "guts". I briefly looked through some of the flink and h2o code and noticed Flink's aggregateOperator [1] and h2o's MapReduce API and [2] my thought was that we could write pass through operators for some of the more useful operations from math-scala and then implement them fully in their respective packages. Though I am not sure how this would work on either cases w.r.t. partitioning. e.g. on h2o's distributed DataFrame. or flink for that matter. Again, I havent had alot of time to look at these and see if this would work at all. My thought was not to bring primitive engine specific aggregetors, combiners, etc. into math-scala. I had thought though that we were trying to develop a fully engine agnostic algorithm library in on top of the R-Like distributed BLAS. So would the idea be to implement i.e. seq2sparse fully in the spark module? It would seem to fracture the project a bit. Or to implement algorithms sequentially if mapBlock() will not suffice and then optimize them in their respective modules? None of that means that R-like algebra cannot be engine agnostic. So people are unhappy about not being able to write the whole in totaly agnostic way? And so they (falsely) infer the pieces of their work cannot be helped by agnosticism individually, or the tools are not being as good as they might be without backend agnosticism? Sorry, but I fail to see the logic there. We proved algebra can be agnostic. I don't think this notion should be disputed. And even if there were a shred of real benefit by making algebra tools un-agnostic, it would not ever outweigh tons of good we could get for the project by integrating with e.g. Flink folks. This one one the points MLLib will never be able to overcome -- to be truly shared ML platform where people could create and share ML, but not just a bunch of ad-hoc spaghetty of distributed api calls and Spark-nailed black boxes. Well yes methodology implementations will still have native distributed calls. Just not nearly as many as they otherwise would, and will be much more easier to support on another back-end using Strategy patterns. E.g. implicit feedback problem that i originally wrote as quasi-method for Spark only, would've taken just an hour or so to add strategy for flink, since it retains all in-core and distributed algebra work as is. Not to mention benefit of single type pipelining. And once we add hardware-accelerated bindings for in-core stuff, all these methods would immediately benefit from it. On MLLib interoperability issues, = well, let me ask you this: what it means to be MLLib-interoperable? is MLLib even interoperable within itself? E.g. i remember there was one most frequent request on the list here: how can we cluster dimensionally-reduced data? Let's look what it takes to do this in MLLib: First, we run tf-idf, which produces collection of vectors (and where did our document ids go? not sure); then we'd have to run svd or pca, both of which would accept RowMatrix (bummer! but we have collection of vectors); which would produce RowMatrix as well but kmeans training takes RDD of vectors (bummer again!). Not directly pluggable, although semi-trivially or trivially convertible. Plus strips off information that we potentially already have computed earlier in the pipeline, so we'd need to compute it
Re: Codebase refactoring proposal
Re: Gokhan's PR post: here are my thoughts but i did not want to post it there since they are going beyond the scope of that PR's work to chase the root of the issue. on quasi-algebraic methods What is the dilemma here? don't see any. I already explained that no more than 25% of algorithms are truly 100% algebraic. But about 80% cannot avoid using some algebra and close to 95% could benefit from using algebra (even stochastic and monte carlo stuff). So we are building system that allows us to cut developer's work by at least 60% and make his work also more readable by 3000%. As far as I am concerned, that fulfills the goal. And I am perfectly happy writing a mix of engine-specific primitives and algebra. That's why i am a bit skeptical about attempts to abstract non-algebraic primitives such as row-wise aggregators in one of the pull requests. Engine-specific primitives and algebra can perfectly co-exist in the guts. And that's how i am doing my stuff in practice, except i now can skip 80% effort on algebra and bridging incompatible intputs-outputs. None of that means that R-like algebra cannot be engine agnostic. So people are unhappy about not being able to write the whole in totaly agnostic way? And so they (falsely) infer the pieces of their work cannot be helped by agnosticism individually, or the tools are not being as good as they might be without backend agnosticism? Sorry, but I fail to see the logic there. We proved algebra can be agnostic. I don't think this notion should be disputed. And even if there were a shred of real benefit by making algebra tools un-agnostic, it would not ever outweigh tons of good we could get for the project by integrating with e.g. Flink folks. This one one the points MLLib will never be able to overcome -- to be truly shared ML platform where people could create and share ML, but not just a bunch of ad-hoc spaghetty of distributed api calls and Spark-nailed black boxes. Well yes methodology implementations will still have native distributed calls. Just not nearly as many as they otherwise would, and will be much more easier to support on another back-end using Strategy patterns. E.g. implicit feedback problem that i originally wrote as quasi-method for Spark only, would've taken just an hour or so to add strategy for flink, since it retains all in-core and distributed algebra work as is. Not to mention benefit of single type pipelining. And once we add hardware-accelerated bindings for in-core stuff, all these methods would immediately benefit from it. On MLLib interoperability issues, = well, let me ask you this: what it means to be MLLib-interoperable? is MLLib even interoperable within itself? E.g. i remember there was one most frequent request on the list here: how can we cluster dimensionally-reduced data? Let's look what it takes to do this in MLLib: First, we run tf-idf, which produces collection of vectors (and where did our document ids go? not sure); then we'd have to run svd or pca, both of which would accept RowMatrix (bummer! but we have collection of vectors); which would produce RowMatrix as well but kmeans training takes RDD of vectors (bummer again!). Not directly pluggable, although semi-trivially or trivially convertible. Plus strips off information that we potentially already have computed earlier in the pipeline, so we'd need to compute it again. I think problem is well demonstrated. Or, say, ALS stuff (implicit als in particular) is really an algebraic problem. Should be taking input in form of matrices (that my feature extraction algebraic pipeline perhaps has just prepared) but really takes POJOs. Bummer again. So what it is exactly we should be interoperable with in this picture if MLLib itself is not consistent? Let's look at the type system in flux there: we have (1) collection of vectors, (2) matrix of known dimensions for collection of vectors (row matrix), (3) indexedRowMatrix which is matrix of known dimension with keys that can be _only_ long; and (4) unknown but not infinitesimal amount of POJO-oriented approaches. But ok, let's constrain ourselves to matrix types only. Multitude of matrix types creates problems for tasks that require consistent key propagation (like SVD or PCA or tf-idf, well demonstrated in the case of mllib). In the aforementioned case of dimensionality reduction over document collection, there's simply no way to propagate document ids to the rows of dimensionally-reduced data. As in none at all. as in hard no-work-around-exists stop. So. There's truly no need for multiple incompatible matrix types. There has to be just single matrix type. Just flexible one. And everything algebraic needs to use it. And if geometry is needed, then it could be either already known or lazily computed, but if it is not needed, nobody bothers to compute it. (i.e. truly no need And this knowledge should not be lost just because we have to convert between types. A
Re: Codebase refactoring proposal
gt;> [INFO] | | | \- >>>>>>>> org.apache.commons:commons-compress:jar:1.4.1:compile >>>>>>>>>> [INFO] | | | \- org.tukaani:xz:jar:1.0:compile >>>>>>>>>> [INFO] | | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile >>>>>>>>>> [INFO] | | +- >>>>>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile >>>>>>>>>> [INFO] | | | +- >>>>>>>>>> >> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile >>>>>>>>>> [INFO] | | | | +- >>>>>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile >>>>>>>>>> [INFO] | | | | | +- com.google.inject:guice:jar:3.0:compile >>>>>>>>>> [INFO] | | | | | | +- >> javax.inject:javax.inject:jar:1:compile >>>>>>>>>> [INFO] | | | | | | \- >> aopalliance:aopalliance:jar:1.0:compile >>>>>>>>>> [INFO] | | | | | +- >>>>>>>>>> >>>>>>>>>> >>>>> >>> >> com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>>>> >>>>>>>>>> >>>>> >>> >> com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | | | +- >>>>>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile >>>>>>>>>> [INFO] | | | | | | | \- >>>>>>>> com.sun.jersey:jersey-client:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | | \- >>>>>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile >>>>>>>>>> [INFO] | | | | | | | \- >>>>>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile >>>>>>>>>> [INFO] | | | | | | | \- >>>>>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile >>>>>>>>>> [INFO] | | | | | | |\- >>>>>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile >>>>>>>>>> [INFO] | | | | | | | \- >>>>>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile >>>>>>>>>> [INFO] | | | | | | \- >>>>>>>> org.glassfish:javax.servlet:jar:3.1:compile >>>>>>>>>> [INFO] | | | | | +- >>> com.sun.jersey:jersey-server:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | | +- asm:asm:jar:3.1:compile >>>>>>>>>> [INFO] | | | | | | \- >>>>> com.sun.jersey:jersey-core:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | +- >> com.sun.jersey:jersey-json:jar:1.9:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>> org.codehaus.jettison:jettison:jar:1.1:compile >>>>>>>>>> [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile >>>>>>>>>> [INFO] | | | | | | | \- >>>>>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile >>>>>>>>>> [INFO] | | | | | | | \- >>>>>>>>>> javax.activation:activation:jar:1.1:compile >>>>>>>>>> [INFO] | | | | | | +- >>>>>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile >>>>>>>>>> [INFO] | | | | | | \- >>>>>>>>>> org.codehaus.j
Re: Codebase refactoring proposal
rsey-test-framework-core:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | | +- >> > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile >> > >>>>>>> [INFO] | | | | | | |\- >> > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >>>>> org.glassfish:javax.servlet:jar:3.1:compile >> > >>>>>>> [INFO] | | | | | +- >> > com.sun.jersey:jersey-server:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- asm:asm:jar:3.1:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >> com.sun.jersey:jersey-core:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | +- >> com.sun.jersey:jersey-json:jar:1.9:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile >> > >>>>>>> [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile >> > >>>>>>> [INFO] | | | | | | | \- >> > >>>>>>> javax.activation:activation:jar:1.1:compile >> > >>>>>>> [INFO] | | | | | | +- >> > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile >> > >>>>>>> [INFO] | | | | | | \- >> > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile >> > >>>>>>> [INFO] | | | | | \- >> > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile >> > >>>>>>> [INFO] | | | | \- >> > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | \- >> > >>>>>>> >> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile >> > >>>>>>> [INFO] | | | \- >> > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile >> > >>>>>>> [INFO] | | +- >> > >>>>>>> >> > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile >> > >>>>>>> [INFO] | | \- >> > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile >> > >>>>>>> [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile >> > >>>>>>> [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile >> > >>>>>>> [INFO] | | \- &g
Re: Codebase refactoring proposal
| \- > > >>>>> org.glassfish:javax.servlet:jar:3.1:compile > > >>>>>>> [INFO] | | | | | +- > > com.sun.jersey:jersey-server:jar:1.9:compile > > >>>>>>> [INFO] | | | | | | +- asm:asm:jar:3.1:compile > > >>>>>>> [INFO] | | | | | | \- > > >> com.sun.jersey:jersey-core:jar:1.9:compile > > >>>>>>> [INFO] | | | | | +- > com.sun.jersey:jersey-json:jar:1.9:compile > > >>>>>>> [INFO] | | | | | | +- > > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile > > >>>>>>> [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile > > >>>>>>> [INFO] | | | | | | +- > > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile > > >>>>>>> [INFO] | | | | | | | \- > > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile > > >>>>>>> [INFO] | | | | | | | \- > > >>>>>>> javax.activation:activation:jar:1.1:compile > > >>>>>>> [INFO] | | | | | | +- > > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile > > >>>>>>> [INFO] | | | | | | \- > > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile > > >>>>>>> [INFO] | | | | | \- > > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile > > >>>>>>> [INFO] | | | | \- > > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile > > >>>>>>> [INFO] | | | \- > > >>>>>>> > org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile > > >>>>>>> [INFO] | | +- > org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile > > >>>>>>> [INFO] | | +- > > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile > > >>>>>>> [INFO] | | | \- > > >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile > > >>>>>>> [INFO] | | +- > > >>>>>>> > > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile > > >>>>>>> [INFO] | | \- > > >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile > > >>>>>>> [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile > > >>>>>>> [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile > > >>>>>>> [INFO] | | \- > > commons-httpclient:commons-httpclient:jar:3.1:compile > > >>>>>>> [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile > > >>>>>>> [INFO] | | +- > > >> org.apache.curator:curator-framework:jar:2.4.0:compile > > >>>>>>> [INFO] | | | \- > > >> org.apache.curator:curator-client:jar:2.4.0:compile > > >>>>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile > > >>>>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile > > >>>>>>> [INFO] | +- > > >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile > > >>>>>>> [INFO] | | +- > > >>>>>>> > > >> > > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile > > >>>>>>> [INFO] | | +- > > >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile > > >>>>>>> [INFO] | | | +- > > >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile > > >>>>>>> [INFO] | | | \- > > >>>>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile > > >>>>>>> [INFO] | | \- > > >>>>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile > > >>>>>>> [INFO] | | \- > > >>>>>>> > > >>>>>>> > > >> > > > org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile > > >>>>>>> [INFO] | |\- > > >>>>>>> > > >> > org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile > > >>>>>>>
Re: Codebase refactoring proposal
.1:compile > >>>>>>> [INFO] | | | \- org.tukaani:xz:jar:1.0:compile > >>>>>>> [INFO] | | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile > >>>>>>> [INFO] | | +- > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile > >>>>>>> [INFO] | | | +- > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile > >>>>>>> [INFO] | | | | +- > >>>>>>> org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile > >>>>>>> [INFO] | | | | | +- com.google.inject:guice:jar:3.0:compile > >>>>>>> [INFO] | | | | | | +- javax.inject:javax.inject:jar:1:compile > >>>>>>> [INFO] | | | | | | \- aopalliance:aopalliance:jar:1.0:compile > >>>>>>> [INFO] | | | | | +- > >>>>>>> > >>>>>>> > >> > com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile > >>>>>>> [INFO] | | | | | | +- > >>>>>>> > >>>>>>> > >> > com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile > >>>>>>> [INFO] | | | | | | | +- > >>>>>>> javax.servlet:javax.servlet-api:jar:3.0.1:compile > >>>>>>> [INFO] | | | | | | | \- > >>>>> com.sun.jersey:jersey-client:jar:1.9:compile > >>>>>>> [INFO] | | | | | | \- > >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile > >>>>>>> [INFO] | | | | | | +- > >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile > >>>>>>> [INFO] | | | | | | | \- > >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile > >>>>>>> [INFO] | | | | | | | \- > >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile > >>>>>>> [INFO] | | | | | | |\- > >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile > >>>>>>> [INFO] | | | | | | +- > >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile > >>>>>>> [INFO] | | | | | | | \- > >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile > >>>>>>> [INFO] | | | | | | +- > >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile > >>>>>>> [INFO] | | | | | | \- > >>>>> org.glassfish:javax.servlet:jar:3.1:compile > >>>>>>> [INFO] | | | | | +- > com.sun.jersey:jersey-server:jar:1.9:compile > >>>>>>> [INFO] | | | | | | +- asm:asm:jar:3.1:compile > >>>>>>> [INFO] | | | | | | \- > >> com.sun.jersey:jersey-core:jar:1.9:compile > >>>>>>> [INFO] | | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile > >>>>>>> [INFO] | | | | | | +- > >>>>> org.codehaus.jettison:jettison:jar:1.1:compile > >>>>>>> [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile > >>>>>>> [INFO] | | | | | | +- > >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile > >>>>>>> [INFO] | | | | | | | \- > >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile > >>>>>>> [INFO] | | | | | | | \- > >>>>>>> javax.activation:activation:jar:1.1:compile > >>>>>>> [INFO] | | | | | | +- > >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile > >>>>>>> [INFO] | | | | | | \- > >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile > >>>>>>> [INFO] | | | | | \- > >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile > >>>>>>> [INFO] | | | | \- > >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile > >>>>>>> [INFO] | | | \- > >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile > >>>>>>> [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile > >>
Re: Codebase refactoring proposal
t; com.sun.jersey:jersey-client:jar:1.9:compile >>>>>>> [INFO] | | | | | | \- >>>>> com.sun.jersey:jersey-grizzly2:jar:1.9:compile >>>>>>> [INFO] | | | | | | +- >>>>>>> org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile >>>>>>> [INFO] | | | | | | | \- >>>>>>> org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile >>>>>>> [INFO] | | | | | | | \- >>>>>>> org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile >>>>>>> [INFO] | | | | | | |\- >>>>>>> org.glassfish.external:management-api:jar:3.0.0-b012:compile >>>>>>> [INFO] | | | | | | +- >>>>>>> org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile >>>>>>> [INFO] | | | | | | | \- >>>>>>> org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile >>>>>>> [INFO] | | | | | | +- >>>>>>> org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile >>>>>>> [INFO] | | | | | | \- >>>>> org.glassfish:javax.servlet:jar:3.1:compile >>>>>>> [INFO] | | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile >>>>>>> [INFO] | | | | | | +- asm:asm:jar:3.1:compile >>>>>>> [INFO] | | | | | | \- >> com.sun.jersey:jersey-core:jar:1.9:compile >>>>>>> [INFO] | | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile >>>>>>> [INFO] | | | | | | +- >>>>> org.codehaus.jettison:jettison:jar:1.1:compile >>>>>>> [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile >>>>>>> [INFO] | | | | | | +- >>>>> com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile >>>>>>> [INFO] | | | | | | | \- >>>>> javax.xml.bind:jaxb-api:jar:2.2.2:compile >>>>>>> [INFO] | | | | | | | \- >>>>>>> javax.activation:activation:jar:1.1:compile >>>>>>> [INFO] | | | | | | +- >>>>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile >>>>>>> [INFO] | | | | | | \- >>>>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile >>>>>>> [INFO] | | | | | \- >>>>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile >>>>>>> [INFO] | | | | \- >>>>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile >>>>>>> [INFO] | | | \- >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile >>>>>>> [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile >>>>>>> [INFO] | | +- >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile >>>>>>> [INFO] | | | \- >>>>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile >>>>>>> [INFO] | | +- >>>>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile >>>>>>> [INFO] | | \- >> org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile >>>>>>> [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile >>>>>>> [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile >>>>>>> [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:compile >>>>>>> [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile >>>>>>> [INFO] | | +- >> org.apache.curator:curator-framework:jar:2.4.0:compile >>>>>>> [INFO] | | | \- >> org.apache.curator:curator-client:jar:2.4.0:compile >>>>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile >>>>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile >>>>>>> [INFO] | +- >> org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile >>>>>>> [INFO] | | +- >>>>>>> >> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile >>>>>>> [INFO] | | +- >>>>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile >>>>>>> [INFO] | | | +- >>>>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile >>>>>>> [INFO] | | | \- >>>>>>
Re: Codebase refactoring proposal
out-hadoop:jar:1.0-SNAPSHOT *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile *[INFO] | +- com.google.guava:guava:jar:16.0:compile* [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel wrote: Looks like Guava is in Spark. On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: When you get a chance a PR would be good. Yes, it would. And not just for that. As I understand it you are putting some class jars somewhere in the classpath. Where? How? /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR code which we of course don't use). Cant say i understand the remaining issues you are talking about though. If you are talking about compiling lib or shaded assembly, no, this doesn't do anything about it. Although point is, as it stands, the algebra and shell don't have any external dependencies but spark and these 4 (5?) mahout jars so they technically don't even need an assembly (as demonstrated). As i said, it seems driver code is the only one that may need some external dependencies, but that's a different scenario from those i am talking about. But i am relatively happy with having the first two working nicely at this point. On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel < p...@occamsmachete.com> wrote: +1 Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to see how you’ve structured that in case we can use the same model to solve the two remaining refactoring issues. 1) external dependencies in the spark module 2) no spark or h2o in the release artifacts. On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: Also +1 iPhone'd On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: +1 Sent from my Verizon Wireless 4G LTE smartphone Original message From: Dmitriy Lyubimov Date:01/23/2015 6:06 PM (GMT-05:00) To: dev@mahout.apache.org Subject: Codebase refactoring proposal So right now mahout-spark depends on mr-legacy. I did quick refactoring and it turns out it only _irrevocably_ depends on the following classes there: MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... *sigh* o.a.m.common.Pair So I just dropped those five classes into new a new tiny mahout-hadoop module (to signify stuff that is directly relevant to serializing thigns to DFS API) and completely removed mrlegacy and its transients from spark and spark-shell dependencies. So non-cli applications (shell scripts and embedded api use) actually only need spark dependencies (which come from SPARK_HOME classpath, of course) and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and optionally mahout-spark-shell (for running shell)). This of course still d
Re: Codebase refactoring proposal
| | | | | | \- > >>>>> javax.activation:activation:jar:1.1:compile > >>>>> [INFO] | | | | | | +- > >>>>> org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile > >>>>> [INFO] | | | | | | \- > >>>>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile > >>>>> [INFO] | | | | | \- > >>>>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile > >>>>> [INFO] | | | | \- > >>>>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile > >>>>> [INFO] | | | \- > >>>>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile > >>>>> [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile > >>>>> [INFO] | | +- > >>>>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile > >>>>> [INFO] | | | \- > >>> org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile > >>>>> [INFO] | | +- > >>>>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile > >>>>> [INFO] | | \- > org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile > >>>>> [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile > >>>>> [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile > >>>>> [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:compile > >>>>> [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile > >>>>> [INFO] | | +- > org.apache.curator:curator-framework:jar:2.4.0:compile > >>>>> [INFO] | | | \- > org.apache.curator:curator-client:jar:2.4.0:compile > >>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile > >>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile > >>>>> [INFO] | +- > org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | +- > >>>>> > >>> > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile > >>>>> [INFO] | | +- > >>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | | +- > >>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | | \- > >>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | \- > >>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | \- > >>>>> > >>>>> > >>> > org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile > >>>>> [INFO] | |\- > >>>>> > >>> > org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile > >>>>> [INFO] | +- > >>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile > >>>>> [INFO] | +- > org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile > >>>>> [INFO] | +- > >>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | +- > >>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile > >>>>> [INFO] | | +- > >>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | \- > >>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile > >>>>> [INFO] | | \- > >>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile > >>>>> [INFO] | +- com.google.guava:guava:jar:16.0:compile > >>>>> d > >>>>> > >>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov > > >>>>> wrote: > >>>>> > >>>>>> looks like it is also requested by mahout-math, wonder what is using > >>> it > >>>>>> there. > >>>>>> > >>>>>> At very least, it needs to be synchronized to the one currently used > >>> by > >>>>>> spark. > >>>>>> > >>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ > >>> mahout-hadoop > >>>>>> --- > >>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT > >>>>>> *[INFO] +- org.apache.mahout:mahout-m
Re: Codebase refactoring proposal
t;> [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile >>>>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile >>>>> [INFO] | | \- jline:jline:jar:0.9.94:compile >>>>> [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile >>>>> [INFO] | | +- >>>>> >>> org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile >>>>> [INFO] | | +- >>> org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile >>>>> [INFO] | | | +- >>> org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile >>>>> [INFO] | | | \- >>>>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile >>>>> [INFO] | | \- >>> org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile >>>>> [INFO] | | \- >>>>> >>>>> >>> org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile >>> >>>>> [INFO] | |\- >>>>> >>> org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile >>>>> [INFO] | +- >>> org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile >>>>> [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile >>>>> [INFO] | +- >>> org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile >>>>> [INFO] | | +- >>>>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile >>>>> [INFO] | | +- >>>>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile >>>>> [INFO] | | \- >>> org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile >>>>> [INFO] | | \- >>> org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile >>>>> [INFO] | +- com.google.guava:guava:jar:16.0:compile >>>>> d >>>>> >>>>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov >>>>> wrote: >>>>> >>>>>> looks like it is also requested by mahout-math, wonder what is using >>> it >>>>>> there. >>>>>> >>>>>> At very least, it needs to be synchronized to the one currently used >>> by >>>>>> spark. >>>>>> >>>>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ >>> mahout-hadoop >>>>>> --- >>>>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT >>>>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* >>>>>> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile >>>>>> *[INFO] | +- com.google.guava:guava:jar:16.0:compile* >>>>>> [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile >>>>>> [INFO] +- >>> org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test >>>>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile >>>>>> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile >>>>>> >>>>>> >>>>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel >>>>> wrote: >>>>>>> Looks like Guava is in Spark. >>>>>>> >>>>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel >>> wrote: >>>>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like >>> this >>>>>>> would not be included since I think it was taken from the mrlegacy >>> jar. >>>>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov >>>>> wrote: >>>>>>> -- Forwarded message -- >>>>>>> From: "Pat Ferrel" >>>>>>> Date: Jan 25, 2015 9:39 AM >>>>>>> Subject: Re: Codebase refactoring proposal >>>>>>> To: >>>>>>> Cc: >>>>>>> >>>>>>>> When you get a chance a PR would be good. >>>>>>> Yes, it would. And not just for that. >>>>>>> >>>>>>>> As I understand it you are putting some class jars somewhere in the >>>>>>> classpath. Where? How? >>>>>>> /bin/mahout >>>>>>> >>>>>>> (Computes 2 different classpaths. See 'bin/mahout classpath' vs. >>>>>
Re: Codebase refactoring proposal
c:jar:1.3:compile [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile [INFO] | | +- org.apache.curator:curator-framework:jar:2.4.0:compile [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile [INFO] | | \- jline:jline:jar:0.9.94:compile [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile [INFO] | | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile [INFO] | | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile [INFO] | |\- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] | +- com.google.guava:guava:jar:16.0:compile d On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov wrote: looks like it is also requested by mahout-math, wonder what is using it there. At very least, it needs to be synchronized to the one currently used by spark. [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop --- [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile *[INFO] | +- com.google.guava:guava:jar:16.0:compile* [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel wrote: Looks like Guava is in Spark. On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: ------ Forwarded message ------ From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: When you get a chance a PR would be good. Yes, it would. And not just for that. As I understand it you are putting some class jars somewhere in the classpath. Where? How? /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR code which we of course don't use). Cant say i understand the remaining issues you are talking about though. If you are talking about compiling lib or shaded assembly, no, this doesn't do anything about it. Although point is, as
Re: Codebase refactoring proposal
etty:jetty-webapp:jar:8.1.14.v20131031:compile [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile [INFO] | | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile [INFO] | |\- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] | +- com.google.guava:guava:jar:16.0:compile d On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov wrote: looks like it is also requested by mahout-math, wonder what is using it there. At very least, it needs to be synchronized to the one currently used by spark. [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop --- [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile *[INFO] | +- com.google.guava:guava:jar:16.0:compile* [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel wrote: Looks like Guava is in Spark. On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: When you get a chance a PR would be good. Yes, it would. And not just for that. As I understand it you are putting some class jars somewhere in the classpath. Where? How? /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR code which we of course don't use). Cant say i understand the remaining issues you are talking about though. If you are talking about compiling lib or shaded assembly, no, this doesn't do anything about it. Although point is, as it stands, the algebra and shell don't have any external dependencies but spark and these 4 (5?) mahout jars so they technically don't even need an assembly (as demonstrated). As i said, it seems driver code is the only one that may need some external dependencies, but that's a different scenario from those i am talking about. But i am relatively happy with having the first two working nicely at this point. On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: +1 Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to see how you’ve structured that i
Re: Codebase refactoring proposal
haus.jackson:jackson-jaxrs:jar:1.8.3:compile >>> [INFO] | | | | | | \- >>> org.codehaus.jackson:jackson-xc:jar:1.8.3:compile >>> [INFO] | | | | | \- >>> com.sun.jersey.contribs:jersey-guice:jar:1.9:compile >>> [INFO] | | | | \- >>> org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile >>> [INFO] | | | \- >>> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile >>> [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile >>> [INFO] | | +- >>> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile >>> [INFO] | | | \- > org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile >>> [INFO] | | +- >>> org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile >>> [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile >>> [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile >>> [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile >>> [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:compile >>> [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile >>> [INFO] | | +- org.apache.curator:curator-framework:jar:2.4.0:compile >>> [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile >>> [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile >>> [INFO] | | \- jline:jline:jar:0.9.94:compile >>> [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile >>> [INFO] | | +- >>> > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile >>> [INFO] | | +- > org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile >>> [INFO] | | | +- > org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile >>> [INFO] | | | \- >>> org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile >>> [INFO] | | \- > org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile >>> [INFO] | | \- >>> >>> > org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile >>> [INFO] | |\- >>> > org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile >>> [INFO] | +- > org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile >>> [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile >>> [INFO] | +- > org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile >>> [INFO] | | +- >>> org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile >>> [INFO] | | +- >>> org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile >>> [INFO] | | \- > org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile >>> [INFO] | | \- > org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile >>> [INFO] | +- com.google.guava:guava:jar:16.0:compile >>> d >>> >>> On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov >>> wrote: >>> >>>> looks like it is also requested by mahout-math, wonder what is using > it >>>> there. >>>> >>>> At very least, it needs to be synchronized to the one currently used > by >>>> spark. >>>> >>>> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ > mahout-hadoop >>>> --- >>>> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT >>>> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* >>>> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile >>>> *[INFO] | +- com.google.guava:guava:jar:16.0:compile* >>>> [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile >>>> [INFO] +- > org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test >>>> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile >>>> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile >>>> >>>> >>>> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel >>> wrote: >>>> >>>>> Looks like Guava is in Spark. >>>>> >>>>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel > wrote: >>>>> >>>>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like > this >>>>> would not be included since I think it was taken from the mrlegacy > jar. >>>>> >>>>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov >>> wrote: >>>>> >>>>> -- Forwarded message
Re: Codebase refactoring proposal
pile > > [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile > > [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile > > [INFO] | | +- > > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile > > [INFO] | | +- > > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile > > [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile > > [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile > > [INFO] | +- com.google.guava:guava:jar:16.0:compile > > d > > > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov > > wrote: > > > >> looks like it is also requested by mahout-math, wonder what is using it > >> there. > >> > >> At very least, it needs to be synchronized to the one currently used by > >> spark. > >> > >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop > >> --- > >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT > >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* > >> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile > >> *[INFO] | +- com.google.guava:guava:jar:16.0:compile* > >> [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile > >> [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test > >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile > >> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile > >> > >> > >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel > > wrote: > >> > >>> Looks like Guava is in Spark. > >>> > >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: > >>> > >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this > >>> would not be included since I think it was taken from the mrlegacy jar. > >>> > >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov > > wrote: > >>> > >>> -- Forwarded message -- > >>> From: "Pat Ferrel" > >>> Date: Jan 25, 2015 9:39 AM > >>> Subject: Re: Codebase refactoring proposal > >>> To: > >>> Cc: > >>> > >>>> When you get a chance a PR would be good. > >>> > >>> Yes, it would. And not just for that. > >>> > >>>> As I understand it you are putting some class jars somewhere in the > >>> classpath. Where? How? > >>>> > >>> > >>> /bin/mahout > >>> > >>> (Computes 2 different classpaths. See 'bin/mahout classpath' vs. > >>> 'bin/mahout -spark'.) > >>> > >>> If i interpret current shell code there correctky, legacy path tries to > >>> use > >>> examples assemblies if not packaged, or /lib if packaged. True > > motivation > >>> of that significantly predates 2010 and i suspect only Benson knows > > whole > >>> true intent there. > >>> > >>> The spark path, which is really a quick hack of the script, tries to get > >>> only selected mahout jars and locally instlalled spark classpath which i > >>> guess is just the shaded spark jar in recent spark releases. It also > >>> apparently tries to include /libs/*, which is never compiled in > > unpackaged > >>> version, and now i think it is a bug it is included because /libs/* is > >>> apparently legacy packaging, and shouldnt be used in spark jobs with a > >>> wildcard. I cant beleive how lazy i am, i still did not find time to > >>> understand mahout build in all cases. > >>> > >>> I am not even sure if packaged mahout will work with spark, honestly, > >>> because of the /lib. Never tried that, since i mostly use application > >>> embedding techniques. > >>> > >>> The same solution may apply to adding external dependencies and removing > >>> the assembly in the Spark module. Which would leave only one major build > >>> issue afaik. > >>>> > >>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov > >>> wrote: > >>>> > >>>> No, no PR. Only experiment on private. But i believe i sufficiently > >>> defined > >>>> what i want to do in order to gauge if we may want to advance it some > >>> time > >>>> later. Goal is much light
Re: Codebase refactoring proposal
.0:compile > > > [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile > > > [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile > > > [INFO] | | \- jline:jline:jar:0.9.94:compile > > > [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile > > > [INFO] | | +- > > > > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile > > > [INFO] | | +- > org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile > > > [INFO] | | | +- > org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile > > > [INFO] | | | \- > > > org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile > > > [INFO] | | \- > org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile > > > [INFO] | | \- > > > > > > > org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile > > > [INFO] | |\- > > > > org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile > > > [INFO] | +- > org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile > > > [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile > > > [INFO] | +- > org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile > > > [INFO] | | +- > > > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile > > > [INFO] | | +- > > > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile > > > [INFO] | | \- > org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile > > > [INFO] | | \- > org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile > > > [INFO] | +- com.google.guava:guava:jar:16.0:compile > > > d > > > > > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov > > > wrote: > > > > > >> looks like it is also requested by mahout-math, wonder what is using > it > > >> there. > > >> > > >> At very least, it needs to be synchronized to the one currently used > by > > >> spark. > > >> > > >> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ > mahout-hadoop > > >> --- > > >> [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT > > >> *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* > > >> [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile > > >> *[INFO] | +- com.google.guava:guava:jar:16.0:compile* > > >> [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile > > >> [INFO] +- > org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test > > >> [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile > > >> [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile > > >> > > >> > > >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel > > > wrote: > > >> > > >>> Looks like Guava is in Spark. > > >>> > > >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel > wrote: > > >>> > > >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like > this > > >>> would not be included since I think it was taken from the mrlegacy > jar. > > >>> > > >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov > > > wrote: > > >>> > > >>> -- Forwarded message -- > > >>> From: "Pat Ferrel" > > >>> Date: Jan 25, 2015 9:39 AM > > >>> Subject: Re: Codebase refactoring proposal > > >>> To: > > >>> Cc: > > >>> > > >>>> When you get a chance a PR would be good. > > >>> > > >>> Yes, it would. And not just for that. > > >>> > > >>>> As I understand it you are putting some class jars somewhere in the > > >>> classpath. Where? How? > > >>>> > > >>> > > >>> /bin/mahout > > >>> > > >>> (Computes 2 different classpaths. See 'bin/mahout classpath' vs. > > >>> 'bin/mahout -spark'.) > > >>> > > >>> If i interpret current shell code there correctky, legacy path tries > to > > >>> use > > >>> examples assemblies if not packaged, or /lib if packaged. True > > > motivation > > >>> of that significantly predates 2010 and i suspect only Benson knows > > > whole > > >>> true intent
Re: Codebase refactoring proposal
ell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: When you get a chance a PR would be good. Yes, it would. And not just for that. As I understand it you are putting some class jars somewhere in the classpath. Where? How? /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR code which we of course don't use). Cant say i understand the remaining issues you are talking about though. If you are talking about compiling lib or shaded assembly, no, this doesn't do anything about it. Although point is, as it stands, the algebra and shell don't have any external dependencies but spark and these 4 (5?) mahout jars so they technically don't even need an assembly (as demonstrated). As i said, it seems driver code is the only one that may need some external dependencies, but that's a different scenario from those i am talking about. But i am relatively happy with having the first two working nicely at this point. On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: +1 Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to see how you’ve structured that in case we can use the same model to solve the two remaining refactoring issues. 1) external dependencies in the spark module 2) no spark or h2o in the release artifacts. On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: Also +1 iPhone'd On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: +1 Sent from my Verizon Wireless 4G LTE smartphone Original message ----From: Dmitriy Lyubimov Date:01/23/2015 6:06 PM (GMT-05:00) To: dev@mahout.apache.org Subject: Codebase refactoring proposal So right now mahout-spark depends on mr-legacy. I did quick refactoring and it turns out it only _irrevocably_ depends on the following classes there: MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... *sigh* o.a.m.common.Pair So I just dropped those five classes into new a new tiny mahout-hadoop module (to signify stuff that is directly relevant to serializing thigns to DFS API) and completely removed mrlegacy and its transients from spark and spark-shell dependencies. So non-cli applications (shell scripts and embedded api use) actually only need spark dependencies (which come from SPARK_HOME classpath, of course) and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and optionally mahout-spark-shell (for running shell)). This of course still doesn't address driver problems that want to throw more stuff into front-end classpath (such as cli parser) but at least it renders transitive luggage of mr-legacy (and the size of worker-shipped jars) much more tolerable. How does that sound?
Re: Codebase refactoring proposal
- org.apache.hadoop:hadoop-common:jar:2.2.0:compile >> >> >> On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel > wrote: >> >>> Looks like Guava is in Spark. >>> >>> On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: >>> >>> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this >>> would not be included since I think it was taken from the mrlegacy jar. >>> >>> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov > wrote: >>> >>> -- Forwarded message -- >>> From: "Pat Ferrel" >>> Date: Jan 25, 2015 9:39 AM >>> Subject: Re: Codebase refactoring proposal >>> To: >>> Cc: >>> >>>> When you get a chance a PR would be good. >>> >>> Yes, it would. And not just for that. >>> >>>> As I understand it you are putting some class jars somewhere in the >>> classpath. Where? How? >>>> >>> >>> /bin/mahout >>> >>> (Computes 2 different classpaths. See 'bin/mahout classpath' vs. >>> 'bin/mahout -spark'.) >>> >>> If i interpret current shell code there correctky, legacy path tries to >>> use >>> examples assemblies if not packaged, or /lib if packaged. True > motivation >>> of that significantly predates 2010 and i suspect only Benson knows > whole >>> true intent there. >>> >>> The spark path, which is really a quick hack of the script, tries to get >>> only selected mahout jars and locally instlalled spark classpath which i >>> guess is just the shaded spark jar in recent spark releases. It also >>> apparently tries to include /libs/*, which is never compiled in > unpackaged >>> version, and now i think it is a bug it is included because /libs/* is >>> apparently legacy packaging, and shouldnt be used in spark jobs with a >>> wildcard. I cant beleive how lazy i am, i still did not find time to >>> understand mahout build in all cases. >>> >>> I am not even sure if packaged mahout will work with spark, honestly, >>> because of the /lib. Never tried that, since i mostly use application >>> embedding techniques. >>> >>> The same solution may apply to adding external dependencies and removing >>> the assembly in the Spark module. Which would leave only one major build >>> issue afaik. >>>> >>>> On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov >>> wrote: >>>> >>>> No, no PR. Only experiment on private. But i believe i sufficiently >>> defined >>>> what i want to do in order to gauge if we may want to advance it some >>> time >>>> later. Goal is much lighter dependency for spark code. Eliminate >>> everything >>>> that is not compile-time dependent. (and a lot of it is thru legacy MR >>> code >>>> which we of course don't use). >>>> >>>> Cant say i understand the remaining issues you are talking about > though. >>>> >>>> If you are talking about compiling lib or shaded assembly, no, this >>> doesn't >>>> do anything about it. Although point is, as it stands, the algebra and >>>> shell don't have any external dependencies but spark and these 4 (5?) >>>> mahout jars so they technically don't even need an assembly (as >>>> demonstrated). >>>> >>>> As i said, it seems driver code is the only one that may need some >>> external >>>> dependencies, but that's a different scenario from those i am talking >>>> about. But i am relatively happy with having the first two working >>> nicely >>>> at this point. >>>> >>>> On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel >>> wrote: >>>> >>>>> +1 >>>>> >>>>> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be >>> nice >>>>> to see how you’ve structured that in case we can use the same model to >>>>> solve the two remaining refactoring issues. >>>>> 1) external dependencies in the spark module >>>>> 2) no spark or h2o in the release artifacts. >>>>> >>>>> On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: >>>>> >>>>> Also +1 >>>>> >>>>> iPhone'd >>>>> >>>>>> On Jan 23, 2015, at 18:38, Andrew Palumbo > wrote: >>>>>> >>>>>> +1 >>>>>> >>>>>> >>>>>> Sent from my Verizon Wireless 4G LTE smartphone >>>>>> >>>>>> Original message From: Dmitriy >>> Lyubimov >>>>> Date:01/23/2015 6:06 PM (GMT-05:00) >>>>> To: dev@mahout.apache.org Subject: Codebase >>>>> refactoring proposal >>>>>> >>>>>> So right now mahout-spark depends on mr-legacy. >>>>>> I did quick refactoring and it turns out it only _irrevocably_ > depends >>> on >>>>>> the following classes there: >>>>>> >>>>>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, > and >>>>> ... >>>>>> *sigh* o.a.m.common.Pair >>>>>> >>>>>> So I just dropped those five classes into new a new tiny >>> mahout-hadoop >>>>>> module (to signify stuff that is directly relevant to serializing >>> thigns >>>>> to >>>>>> DFS API) and completely removed mrlegacy and its transients from > spark >>>>> and >>>>>> spark-shell dependencies. >>>>>> >>>>>> So non-cli applications (shell scripts and embedded api use) actually >>>>> only >>>>>> need spark dependencies (which come from SPARK_HOME classpath, of >>> course) >>>>>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and >>>>>> optionally mahout-spark-shell (for running shell)). >>>>>> >>>>>> This of course still doesn't address driver problems that want to >>> throw >>>>>> more stuff into front-end classpath (such as cli parser) but at least >>> it >>>>>> renders transitive luggage of mr-legacy (and the size of >>> worker-shipped >>>>>> jars) much more tolerable. >>>>>> >>>>>> How does that sound? >>>>> >>>>> >>>> >>> >>> >>> >> > >
Re: Codebase refactoring proposal
org.glassfish:javax.servlet:jar:3.1:compile > [INFO] | | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile > [INFO] | | | | | | +- asm:asm:jar:3.1:compile > [INFO] | | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile > [INFO] | | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile > [INFO] | | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile > [INFO] | | | | | | | \- stax:stax-api:jar:1.0.1:compile > [INFO] | | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile > [INFO] | | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile > [INFO] | | | | | | | \- > javax.activation:activation:jar:1.1:compile > [INFO] | | | | | | +- > org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile > [INFO] | | | | | | \- > org.codehaus.jackson:jackson-xc:jar:1.8.3:compile > [INFO] | | | | | \- > com.sun.jersey.contribs:jersey-guice:jar:1.9:compile > [INFO] | | | | \- > org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile > [INFO] | | | \- > org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile > [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile > [INFO] | | +- > org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile > [INFO] | | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile > [INFO] | | +- > org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile > [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile > [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile > [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile > [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:compile > [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile > [INFO] | | +- org.apache.curator:curator-framework:jar:2.4.0:compile > [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile > [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile > [INFO] | | \- jline:jline:jar:0.9.94:compile > [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile > [INFO] | | +- > org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile > [INFO] | | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile > [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile > [INFO] | | | \- > org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile > [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile > [INFO] | | \- > > org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile > [INFO] | |\- > org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile > [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile > [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile > [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile > [INFO] | | +- > org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile > [INFO] | | +- > org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile > [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile > [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile > [INFO] | +- com.google.guava:guava:jar:16.0:compile > d > > On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov > wrote: > > > looks like it is also requested by mahout-math, wonder what is using it > > there. > > > > At very least, it needs to be synchronized to the one currently used by > > spark. > > > > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop > > --- > > [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT > > *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* > > [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile > > *[INFO] | +- com.google.guava:guava:jar:16.0:compile* > > [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile > > [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test > > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile > > [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile > > > > > > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel > wrote: > > > >> Looks like Guava is in Spark. > >> > >> On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: > >> > >> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this > >> would not be included since I think it was taken from the mrlegacy jar. > >> > >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov > wrote: > >> > >> -- Forwarded message ---
Re: Codebase refactoring proposal
adoop-yarn-common:jar:2.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile [INFO] | | +- org.apache.curator:curator-framework:jar:2.4.0:compile [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile [INFO] | | \- jline:jline:jar:0.9.94:compile [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile [INFO] | | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile [INFO] | | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile [INFO] | |\- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] | +- com.google.guava:guava:jar:16.0:compile d On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov wrote: > looks like it is also requested by mahout-math, wonder what is using it > there. > > At very least, it needs to be synchronized to the one currently used by > spark. > > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop > --- > [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT > *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* > [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile > *[INFO] | +- com.google.guava:guava:jar:16.0:compile* > [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile > [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile > [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile > > > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel wrote: > >> Looks like Guava is in Spark. >> >> On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: >> >> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this >> would not be included since I think it was taken from the mrlegacy jar. >> >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: >> >> -- Forwarded message -- >> From: "Pat Ferrel" >> Date: Jan 25, 2015 9:39 AM >> Subject: Re: Codebase refactoring proposal >> To: >> Cc: >> >>> When you get a chance a PR would be good. >> >> Yes, it would. And not just for that. >> >>> As I understand it you are putting some class jars somewhere in the >> classpath. Where? How? >>> >> >> /bin/mahout >> >> (Computes 2 different classpaths. See 'bin/mahout classpath' vs. >> 'bin/mahout -spark'.) >> >> If i interpret current shell code there correctky, legacy path tries to >> use >> examples assemblies if not packaged, or /lib if packaged. True motivation >> of that significantly predates 2010 and i suspect only Benson knows whole >> true intent there. >> >> The spark path, which is really a quick hack of the script, tries to get >> only selected mahout jars and locally instlalled spark classpath which i >> guess is just the shaded spark jar in recent spark releases. It also >> apparently tries to include /libs/*, which is never compiled in unpackaged >> version, and now i think it is a bug it is included because /libs/* is >> apparently legacy packaging, and shouldnt be used in spark jobs with a >> wildcard. I cant beleive how lazy i am, i still did not find time to >> understand mahout build in all cases. >> >> I am not even sure if packaged mahout will work with spark, honestly, >> because of the /lib. Never tried that, since i mostly use application >> embedding techniques. &
Re: Codebase refactoring proposal
pile [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile [INFO] | | \- jline:jline:jar:0.9.94:compile [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile [INFO] | | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile [INFO] | | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile [INFO] | |\- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] | +- com.google.guava:guava:jar:16.0:compile d On Fri, Jan 30, 2015 at 4:03 PM, Dmitriy Lyubimov wrote: > looks like it is also requested by mahout-math, wonder what is using it > there. > > At very least, it needs to be synchronized to the one currently used by > spark. > > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop > --- > [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT > *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* > [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile > *[INFO] | +- com.google.guava:guava:jar:16.0:compile* > [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile > [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test > [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile > [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile > > > On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel wrote: > >> Looks like Guava is in Spark. >> >> On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: >> >> IndexedDataset uses Guava. Can’t tell from sure but it sounds like this >> would not be included since I think it was taken from the mrlegacy jar. >> >> On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: >> >> -- Forwarded message -- >> From: "Pat Ferrel" >> Date: Jan 25, 2015 9:39 AM >> Subject: Re: Codebase refactoring proposal >> To: >> Cc: >> >> > When you get a chance a PR would be good. >> >> Yes, it would. And not just for that. >> >> > As I understand it you are putting some class jars somewhere in the >> classpath. Where? How? >> > >> >> /bin/mahout >> >> (Computes 2 different classpaths. See 'bin/mahout classpath' vs. >> 'bin/mahout -spark'.) >> >> If i interpret current shell code there correctky, legacy path tries to >> use >> examples assemblies if not packaged, or /lib if packaged. True motivation >> of that significantly predates 2010 and i suspect only Benson knows whole >> true intent there. >> >> The spark path, which is really a quick hack of the script, tries to get >> only selected mahout jars and locally instlalled spark classpath which i >> guess is just the shaded spark jar in recent spark releases. It also >> apparently tries to include /libs/*, which is never compiled in unpackaged >> version, and now i think it is a bug it is included because /libs/* is >> apparently legacy packaging, and shouldnt be used in spark jobs with a >> wildcard. I cant beleive how lazy i am, i still did not find time to >> understand mahout build in all cases. >> >> I am not even sure if packaged mahout will work with spark, honestly, >> because of the /lib. Never tried that, since i mostly use application >> embedding techniques. >> >> The same solution may apply to adding external dependencies and removing >> the assembly in the Spark module. Which would leave only one major build >> issue afaik. >> > >> > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov >> wrote: >> > >> > No, no PR. Only experiment on private. But i believe i sufficiently >> defined >> > what i want to do in order to gauge if we may want to advance it some >> time >> > later. Goal is much lighter dependency for spark code. Eliminate >> everything >&
Re: Codebase refactoring proposal
looks like it is also requested by mahout-math, wonder what is using it there. At very least, it needs to be synchronized to the one currently used by spark. [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ mahout-hadoop --- [INFO] org.apache.mahout:mahout-hadoop:jar:1.0-SNAPSHOT *[INFO] +- org.apache.mahout:mahout-math:jar:1.0-SNAPSHOT:compile* [INFO] | +- org.apache.commons:commons-math3:jar:3.2:compile *[INFO] | +- com.google.guava:guava:jar:16.0:compile* [INFO] | \- com.tdunning:t-digest:jar:2.0.2:compile [INFO] +- org.apache.mahout:mahout-math:test-jar:tests:1.0-SNAPSHOT:test [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile On Fri, Jan 30, 2015 at 7:52 AM, Pat Ferrel wrote: > Looks like Guava is in Spark. > > On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: > > IndexedDataset uses Guava. Can’t tell from sure but it sounds like this > would not be included since I think it was taken from the mrlegacy jar. > > On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: > > -- Forwarded message -- > From: "Pat Ferrel" > Date: Jan 25, 2015 9:39 AM > Subject: Re: Codebase refactoring proposal > To: > Cc: > > > When you get a chance a PR would be good. > > Yes, it would. And not just for that. > > > As I understand it you are putting some class jars somewhere in the > classpath. Where? How? > > > > /bin/mahout > > (Computes 2 different classpaths. See 'bin/mahout classpath' vs. > 'bin/mahout -spark'.) > > If i interpret current shell code there correctky, legacy path tries to use > examples assemblies if not packaged, or /lib if packaged. True motivation > of that significantly predates 2010 and i suspect only Benson knows whole > true intent there. > > The spark path, which is really a quick hack of the script, tries to get > only selected mahout jars and locally instlalled spark classpath which i > guess is just the shaded spark jar in recent spark releases. It also > apparently tries to include /libs/*, which is never compiled in unpackaged > version, and now i think it is a bug it is included because /libs/* is > apparently legacy packaging, and shouldnt be used in spark jobs with a > wildcard. I cant beleive how lazy i am, i still did not find time to > understand mahout build in all cases. > > I am not even sure if packaged mahout will work with spark, honestly, > because of the /lib. Never tried that, since i mostly use application > embedding techniques. > > The same solution may apply to adding external dependencies and removing > the assembly in the Spark module. Which would leave only one major build > issue afaik. > > > > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov > wrote: > > > > No, no PR. Only experiment on private. But i believe i sufficiently > defined > > what i want to do in order to gauge if we may want to advance it some > time > > later. Goal is much lighter dependency for spark code. Eliminate > everything > > that is not compile-time dependent. (and a lot of it is thru legacy MR > code > > which we of course don't use). > > > > Cant say i understand the remaining issues you are talking about though. > > > > If you are talking about compiling lib or shaded assembly, no, this > doesn't > > do anything about it. Although point is, as it stands, the algebra and > > shell don't have any external dependencies but spark and these 4 (5?) > > mahout jars so they technically don't even need an assembly (as > > demonstrated). > > > > As i said, it seems driver code is the only one that may need some > external > > dependencies, but that's a different scenario from those i am talking > > about. But i am relatively happy with having the first two working nicely > > at this point. > > > > On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel > wrote: > > > >> +1 > >> > >> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be > nice > >> to see how you’ve structured that in case we can use the same model to > >> solve the two remaining refactoring issues. > >> 1) external dependencies in the spark module > >> 2) no spark or h2o in the release artifacts. > >> > >> On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: > >> > >> Also +1 > >> > >> iPhone'd > >> > >>> On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: > >>> > >>> +1 > >>> > >>> > >>> Sent from my Verizon Wireless 4G LTE smartphone > &g
Re: Codebase refactoring proposal
Looks like Guava is in Spark. On Jan 29, 2015, at 4:03 PM, Pat Ferrel wrote: IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: > When you get a chance a PR would be good. Yes, it would. And not just for that. > As I understand it you are putting some class jars somewhere in the classpath. Where? How? > /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. > > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: > > No, no PR. Only experiment on private. But i believe i sufficiently defined > what i want to do in order to gauge if we may want to advance it some time > later. Goal is much lighter dependency for spark code. Eliminate everything > that is not compile-time dependent. (and a lot of it is thru legacy MR code > which we of course don't use). > > Cant say i understand the remaining issues you are talking about though. > > If you are talking about compiling lib or shaded assembly, no, this doesn't > do anything about it. Although point is, as it stands, the algebra and > shell don't have any external dependencies but spark and these 4 (5?) > mahout jars so they technically don't even need an assembly (as > demonstrated). > > As i said, it seems driver code is the only one that may need some external > dependencies, but that's a different scenario from those i am talking > about. But i am relatively happy with having the first two working nicely > at this point. > > On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: > >> +1 >> >> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice >> to see how you’ve structured that in case we can use the same model to >> solve the two remaining refactoring issues. >> 1) external dependencies in the spark module >> 2) no spark or h2o in the release artifacts. >> >> On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: >> >> Also +1 >> >> iPhone'd >> >>> On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: >>> >>> +1 >>> >>> >>> Sent from my Verizon Wireless 4G LTE smartphone >>> >>> Original message From: Dmitriy Lyubimov >> Date:01/23/2015 6:06 PM (GMT-05:00) >> To: dev@mahout.apache.org Subject: Codebase >> refactoring proposal >>> >>> So right now mahout-spark depends on mr-legacy. >>> I did quick refactoring and it turns out it only _irrevocably_ depends on >>> the following classes there: >>> >>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and >> ... >>> *sigh* o.a.m.common.Pair >>> >>> So I just dropped those five classes into new a new tiny mahout-hadoop >>> module (to signify stuff that is directly relevant to serializing thigns >> to >>> DFS API) and completely removed mrlegacy and its transients from spark >> and >>> spark-shell dependencies. >>> >>> So non-cli applications (shell scripts and embedded api use) actually >> only >>> need spark dependencies (which come from SPARK_HOME classpath, of course) >>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and >>> optionally mahout-spark-shell (for running shell)). >>> >>> This of course still doesn't address driver problems that want to throw >>> more stuff into front-end classpath (such as cli parser) but at least it >>> renders transitive luggage of mr-legacy (and the size of worker-shipped >>> jars) much more tolerable. >>> >>> How does that sound? >> >> >
Re: Codebase refactoring proposal
IndexedDataset uses Guava. Can’t tell from sure but it sounds like this would not be included since I think it was taken from the mrlegacy jar. On Jan 25, 2015, at 10:52 AM, Dmitriy Lyubimov wrote: -- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: > When you get a chance a PR would be good. Yes, it would. And not just for that. > As I understand it you are putting some class jars somewhere in the classpath. Where? How? > /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. > > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: > > No, no PR. Only experiment on private. But i believe i sufficiently defined > what i want to do in order to gauge if we may want to advance it some time > later. Goal is much lighter dependency for spark code. Eliminate everything > that is not compile-time dependent. (and a lot of it is thru legacy MR code > which we of course don't use). > > Cant say i understand the remaining issues you are talking about though. > > If you are talking about compiling lib or shaded assembly, no, this doesn't > do anything about it. Although point is, as it stands, the algebra and > shell don't have any external dependencies but spark and these 4 (5?) > mahout jars so they technically don't even need an assembly (as > demonstrated). > > As i said, it seems driver code is the only one that may need some external > dependencies, but that's a different scenario from those i am talking > about. But i am relatively happy with having the first two working nicely > at this point. > > On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: > >> +1 >> >> Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice >> to see how you’ve structured that in case we can use the same model to >> solve the two remaining refactoring issues. >> 1) external dependencies in the spark module >> 2) no spark or h2o in the release artifacts. >> >> On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: >> >> Also +1 >> >> iPhone'd >> >>> On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: >>> >>> +1 >>> >>> >>> Sent from my Verizon Wireless 4G LTE smartphone >>> >>> Original message From: Dmitriy Lyubimov >> Date:01/23/2015 6:06 PM (GMT-05:00) >> To: dev@mahout.apache.org Subject: Codebase >> refactoring proposal >>> >>> So right now mahout-spark depends on mr-legacy. >>> I did quick refactoring and it turns out it only _irrevocably_ depends on >>> the following classes there: >>> >>> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and >> ... >>> *sigh* o.a.m.common.Pair >>> >>> So I just dropped those five classes into new a new tiny mahout-hadoop >>> module (to signify stuff that is directly relevant to serializing thigns >> to >>> DFS API) and completely removed mrlegacy and its transients from spark >> and >>> spark-shell dependencies. >>> >>> So non-cli applications (shell scripts and embedded api use) actually >> only >>> need spark dependencies (which come from SPARK_HOME classpath, of course) >>> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and >>> optionally mahout-spark-shell (for running shell)). >>> >>> This of course still doesn't address driver problems that want to throw >>> more stuff into front-end classpath (such as cli parser) but at least it >>> renders transitive luggage of mr-legacy (and the size of worker-shipped >>> jars) much more tolerable. >>> >>> How does that sound? >> >> >
Re: Codebase refactoring proposal
-- Forwarded message -- From: "Pat Ferrel" Date: Jan 25, 2015 9:39 AM Subject: Re: Codebase refactoring proposal To: Cc: > When you get a chance a PR would be good. Yes, it would. And not just for that. >As I understand it you are putting some class jars somewhere in the classpath. Where? How? > /bin/mahout (Computes 2 different classpaths. See 'bin/mahout classpath' vs. 'bin/mahout -spark'.) If i interpret current shell code there correctky, legacy path tries to use examples assemblies if not packaged, or /lib if packaged. True motivation of that significantly predates 2010 and i suspect only Benson knows whole true intent there. The spark path, which is really a quick hack of the script, tries to get only selected mahout jars and locally instlalled spark classpath which i guess is just the shaded spark jar in recent spark releases. It also apparently tries to include /libs/*, which is never compiled in unpackaged version, and now i think it is a bug it is included because /libs/* is apparently legacy packaging, and shouldnt be used in spark jobs with a wildcard. I cant beleive how lazy i am, i still did not find time to understand mahout build in all cases. I am not even sure if packaged mahout will work with spark, honestly, because of the /lib. Never tried that, since i mostly use application embedding techniques. The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. > > On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: > > No, no PR. Only experiment on private. But i believe i sufficiently defined > what i want to do in order to gauge if we may want to advance it some time > later. Goal is much lighter dependency for spark code. Eliminate everything > that is not compile-time dependent. (and a lot of it is thru legacy MR code > which we of course don't use). > > Cant say i understand the remaining issues you are talking about though. > > If you are talking about compiling lib or shaded assembly, no, this doesn't > do anything about it. Although point is, as it stands, the algebra and > shell don't have any external dependencies but spark and these 4 (5?) > mahout jars so they technically don't even need an assembly (as > demonstrated). > > As i said, it seems driver code is the only one that may need some external > dependencies, but that's a different scenario from those i am talking > about. But i am relatively happy with having the first two working nicely > at this point. > > On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: > > > +1 > > > > Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice > > to see how you’ve structured that in case we can use the same model to > > solve the two remaining refactoring issues. > > 1) external dependencies in the spark module > > 2) no spark or h2o in the release artifacts. > > > > On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: > > > > Also +1 > > > > iPhone'd > > > >> On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: > >> > >> +1 > >> > >> > >> Sent from my Verizon Wireless 4G LTE smartphone > >> > >> Original message From: Dmitriy Lyubimov > > Date:01/23/2015 6:06 PM (GMT-05:00) > > To: dev@mahout.apache.org Subject: Codebase > > refactoring proposal > >> > >> So right now mahout-spark depends on mr-legacy. > >> I did quick refactoring and it turns out it only _irrevocably_ depends on > >> the following classes there: > >> > >> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and > > ... > >> *sigh* o.a.m.common.Pair > >> > >> So I just dropped those five classes into new a new tiny mahout-hadoop > >> module (to signify stuff that is directly relevant to serializing thigns > > to > >> DFS API) and completely removed mrlegacy and its transients from spark > > and > >> spark-shell dependencies. > >> > >> So non-cli applications (shell scripts and embedded api use) actually > > only > >> need spark dependencies (which come from SPARK_HOME classpath, of course) > >> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and > >> optionally mahout-spark-shell (for running shell)). > >> > >> This of course still doesn't address driver problems that want to throw > >> more stuff into front-end classpath (such as cli parser) but at least it > >> renders transitive luggage of mr-legacy (and the size of worker-shipped > >> jars) much more tolerable. > >> > >> How does that sound? > > > > >
Re: Codebase refactoring proposal
When you get a chance a PR would be good. As I understand it you are putting some class jars somewhere in the classpath. Where? How? The same solution may apply to adding external dependencies and removing the assembly in the Spark module. Which would leave only one major build issue afaik. On Jan 24, 2015, at 11:53 PM, Dmitriy Lyubimov wrote: No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR code which we of course don't use). Cant say i understand the remaining issues you are talking about though. If you are talking about compiling lib or shaded assembly, no, this doesn't do anything about it. Although point is, as it stands, the algebra and shell don't have any external dependencies but spark and these 4 (5?) mahout jars so they technically don't even need an assembly (as demonstrated). As i said, it seems driver code is the only one that may need some external dependencies, but that's a different scenario from those i am talking about. But i am relatively happy with having the first two working nicely at this point. On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: > +1 > > Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice > to see how you’ve structured that in case we can use the same model to > solve the two remaining refactoring issues. > 1) external dependencies in the spark module > 2) no spark or h2o in the release artifacts. > > On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: > > Also +1 > > iPhone'd > >> On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: >> >> +1 >> >> >> Sent from my Verizon Wireless 4G LTE smartphone >> >> Original message ----From: Dmitriy Lyubimov > Date:01/23/2015 6:06 PM (GMT-05:00) > To: dev@mahout.apache.org Subject: Codebase > refactoring proposal >> >> So right now mahout-spark depends on mr-legacy. >> I did quick refactoring and it turns out it only _irrevocably_ depends on >> the following classes there: >> >> MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and > ... >> *sigh* o.a.m.common.Pair >> >> So I just dropped those five classes into new a new tiny mahout-hadoop >> module (to signify stuff that is directly relevant to serializing thigns > to >> DFS API) and completely removed mrlegacy and its transients from spark > and >> spark-shell dependencies. >> >> So non-cli applications (shell scripts and embedded api use) actually > only >> need spark dependencies (which come from SPARK_HOME classpath, of course) >> and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and >> optionally mahout-spark-shell (for running shell)). >> >> This of course still doesn't address driver problems that want to throw >> more stuff into front-end classpath (such as cli parser) but at least it >> renders transitive luggage of mr-legacy (and the size of worker-shipped >> jars) much more tolerable. >> >> How does that sound? > >
Re: Codebase refactoring proposal
No, no PR. Only experiment on private. But i believe i sufficiently defined what i want to do in order to gauge if we may want to advance it some time later. Goal is much lighter dependency for spark code. Eliminate everything that is not compile-time dependent. (and a lot of it is thru legacy MR code which we of course don't use). Cant say i understand the remaining issues you are talking about though. If you are talking about compiling lib or shaded assembly, no, this doesn't do anything about it. Although point is, as it stands, the algebra and shell don't have any external dependencies but spark and these 4 (5?) mahout jars so they technically don't even need an assembly (as demonstrated). As i said, it seems driver code is the only one that may need some external dependencies, but that's a different scenario from those i am talking about. But i am relatively happy with having the first two working nicely at this point. On Sat, Jan 24, 2015 at 9:06 AM, Pat Ferrel wrote: > +1 > > Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice > to see how you’ve structured that in case we can use the same model to > solve the two remaining refactoring issues. > 1) external dependencies in the spark module > 2) no spark or h2o in the release artifacts. > > On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: > > Also +1 > > iPhone'd > > > On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: > > > > +1 > > > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > > Original message ----From: Dmitriy Lyubimov > Date:01/23/2015 6:06 PM (GMT-05:00) > To: dev@mahout.apache.org Subject: Codebase > refactoring proposal > > > > So right now mahout-spark depends on mr-legacy. > > I did quick refactoring and it turns out it only _irrevocably_ depends on > > the following classes there: > > > > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and > ... > > *sigh* o.a.m.common.Pair > > > > So I just dropped those five classes into new a new tiny mahout-hadoop > > module (to signify stuff that is directly relevant to serializing thigns > to > > DFS API) and completely removed mrlegacy and its transients from spark > and > > spark-shell dependencies. > > > > So non-cli applications (shell scripts and embedded api use) actually > only > > need spark dependencies (which come from SPARK_HOME classpath, of course) > > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and > > optionally mahout-spark-shell (for running shell)). > > > > This of course still doesn't address driver problems that want to throw > > more stuff into front-end classpath (such as cli parser) but at least it > > renders transitive luggage of mr-legacy (and the size of worker-shipped > > jars) much more tolerable. > > > > How does that sound? > >
Re: Codebase refactoring proposal
+1 Is there a PR? You mention a "tiny mahout-hadoop” module. It would be nice to see how you’ve structured that in case we can use the same model to solve the two remaining refactoring issues. 1) external dependencies in the spark module 2) no spark or h2o in the release artifacts. On Jan 23, 2015, at 6:45 PM, Shannon Quinn wrote: Also +1 iPhone'd > On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: > > +1 > > > Sent from my Verizon Wireless 4G LTE smartphone > > Original message From: Dmitriy Lyubimov > Date:01/23/2015 6:06 PM (GMT-05:00) > To: dev@mahout.apache.org Subject: Codebase refactoring > proposal > > So right now mahout-spark depends on mr-legacy. > I did quick refactoring and it turns out it only _irrevocably_ depends on > the following classes there: > > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... > *sigh* o.a.m.common.Pair > > So I just dropped those five classes into new a new tiny mahout-hadoop > module (to signify stuff that is directly relevant to serializing thigns to > DFS API) and completely removed mrlegacy and its transients from spark and > spark-shell dependencies. > > So non-cli applications (shell scripts and embedded api use) actually only > need spark dependencies (which come from SPARK_HOME classpath, of course) > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and > optionally mahout-spark-shell (for running shell)). > > This of course still doesn't address driver problems that want to throw > more stuff into front-end classpath (such as cli parser) but at least it > renders transitive luggage of mr-legacy (and the size of worker-shipped > jars) much more tolerable. > > How does that sound?
Re: Codebase refactoring proposal
Also +1 iPhone'd > On Jan 23, 2015, at 18:38, Andrew Palumbo wrote: > > +1 > > > Sent from my Verizon Wireless 4G LTE smartphone > > Original message From: Dmitriy Lyubimov > Date:01/23/2015 6:06 PM (GMT-05:00) > To: dev@mahout.apache.org Subject: Codebase refactoring > proposal > > So right now mahout-spark depends on mr-legacy. > I did quick refactoring and it turns out it only _irrevocably_ depends on > the following classes there: > > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... > *sigh* o.a.m.common.Pair > > So I just dropped those five classes into new a new tiny mahout-hadoop > module (to signify stuff that is directly relevant to serializing thigns to > DFS API) and completely removed mrlegacy and its transients from spark and > spark-shell dependencies. > > So non-cli applications (shell scripts and embedded api use) actually only > need spark dependencies (which come from SPARK_HOME classpath, of course) > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and > optionally mahout-spark-shell (for running shell)). > > This of course still doesn't address driver problems that want to throw > more stuff into front-end classpath (such as cli parser) but at least it > renders transitive luggage of mr-legacy (and the size of worker-shipped > jars) much more tolerable. > > How does that sound?
RE: Codebase refactoring proposal
+1 Sent from my Verizon Wireless 4G LTE smartphone Original message From: Dmitriy Lyubimov Date:01/23/2015 6:06 PM (GMT-05:00) To: dev@mahout.apache.org Subject: Codebase refactoring proposal So right now mahout-spark depends on mr-legacy. I did quick refactoring and it turns out it only _irrevocably_ depends on the following classes there: MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... *sigh* o.a.m.common.Pair So I just dropped those five classes into new a new tiny mahout-hadoop module (to signify stuff that is directly relevant to serializing thigns to DFS API) and completely removed mrlegacy and its transients from spark and spark-shell dependencies. So non-cli applications (shell scripts and embedded api use) actually only need spark dependencies (which come from SPARK_HOME classpath, of course) and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and optionally mahout-spark-shell (for running shell)). This of course still doesn't address driver problems that want to throw more stuff into front-end classpath (such as cli parser) but at least it renders transitive luggage of mr-legacy (and the size of worker-shipped jars) much more tolerable. How does that sound?
Re: Codebase refactoring proposal
sorry i meant _without_ mrlegacy on classpath. On Fri, Jan 23, 2015 at 3:31 PM, Dmitriy Lyubimov wrote: > And in case anyone wonders yes shell starts and runs test script totally > fine with mrlegacy dependency on classpath (startup script modified to use > mahout-hadoop instead) -- both in local and distributed (standalone) mode: > > > > $ MASTER=spark://localhost:7077 bin/mahout spark-shell > > _ _ > _ __ ___ __ _| |__ ___ _ _| |_ > | '_ ` _ \ / _` | '_ \ / _ \| | | | __| > | | | | | | (_| | | | | (_) | |_| | |_ > |_| |_| |_|\__,_|_| |_|\___/ \__,_|\__| version 1.0 > > > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.7.0_71) > Type in expressions to have them evaluated. > Type :help for more information. > 15/01/23 15:28:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 15/01/23 15:28:26 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > Created spark context.. > Mahout distributed context is available as "implicit val sdc". > > > mahout> :load spark-shell/src/test/mahout/simple.mscala > Loading spark-shell/src/test/mahout/simple.mscala... > a: org.apache.mahout.math.DenseMatrix = > { > 0 => {0:1.0,1:2.0,2:3.0} > 1 => {0:3.0,1:4.0,2:5.0} > } > drmA: org.apache.mahout.math.drm.CheckpointedDrm[Int] = > org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5 > drmAtA: org.apache.mahout.math.drm.DrmLike[Int] = > OpAB(OpAt(org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5 > ),org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5) > r: org.apache.mahout.math.drm.CheckpointedDrm[Int] = > org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@3c46dadf > res4: org.apache.mahout.math.Matrix = > { > 0 => {0:11.0,1:15.0,2:19.0} > 1 => {0:15.0,1:21.0,2:27.0} > 2 => {0:19.0,1:27.0,2:35.0} > } > mahout> > > > On Fri, Jan 23, 2015 at 3:07 PM, Suneel Marthi > wrote: > >> +1 >> >> On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov >> wrote: >> >> > So right now mahout-spark depends on mr-legacy. >> > I did quick refactoring and it turns out it only _irrevocably_ depends >> on >> > the following classes there: >> > >> > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and >> ... >> > *sigh* o.a.m.common.Pair >> > >> > So I just dropped those five classes into new a new tiny mahout-hadoop >> > module (to signify stuff that is directly relevant to serializing >> thigns to >> > DFS API) and completely removed mrlegacy and its transients from spark >> and >> > spark-shell dependencies. >> > >> > So non-cli applications (shell scripts and embedded api use) actually >> only >> > need spark dependencies (which come from SPARK_HOME classpath, of >> course) >> > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and >> > optionally mahout-spark-shell (for running shell)). >> > >> > This of course still doesn't address driver problems that want to throw >> > more stuff into front-end classpath (such as cli parser) but at least it >> > renders transitive luggage of mr-legacy (and the size of worker-shipped >> > jars) much more tolerable. >> > >> > How does that sound? >> > >> > >
Re: Codebase refactoring proposal
And in case anyone wonders yes shell starts and runs test script totally fine with mrlegacy dependency on classpath (startup script modified to use mahout-hadoop instead) -- both in local and distributed (standalone) mode: $ MASTER=spark://localhost:7077 bin/mahout spark-shell _ _ _ __ ___ __ _| |__ ___ _ _| |_ | '_ ` _ \ / _` | '_ \ / _ \| | | | __| | | | | | | (_| | | | | (_) | |_| | |_ |_| |_| |_|\__,_|_| |_|\___/ \__,_|\__| version 1.0 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. Type :help for more information. 15/01/23 15:28:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/01/23 15:28:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Created spark context.. Mahout distributed context is available as "implicit val sdc". mahout> :load spark-shell/src/test/mahout/simple.mscala Loading spark-shell/src/test/mahout/simple.mscala... a: org.apache.mahout.math.DenseMatrix = { 0 => {0:1.0,1:2.0,2:3.0} 1 => {0:3.0,1:4.0,2:5.0} } drmA: org.apache.mahout.math.drm.CheckpointedDrm[Int] = org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5 drmAtA: org.apache.mahout.math.drm.DrmLike[Int] = OpAB(OpAt(org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5 ),org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@7940bbc5) r: org.apache.mahout.math.drm.CheckpointedDrm[Int] = org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@3c46dadf res4: org.apache.mahout.math.Matrix = { 0 => {0:11.0,1:15.0,2:19.0} 1 => {0:15.0,1:21.0,2:27.0} 2 => {0:19.0,1:27.0,2:35.0} } mahout> On Fri, Jan 23, 2015 at 3:07 PM, Suneel Marthi wrote: > +1 > > On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov > wrote: > > > So right now mahout-spark depends on mr-legacy. > > I did quick refactoring and it turns out it only _irrevocably_ depends on > > the following classes there: > > > > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and > ... > > *sigh* o.a.m.common.Pair > > > > So I just dropped those five classes into new a new tiny mahout-hadoop > > module (to signify stuff that is directly relevant to serializing thigns > to > > DFS API) and completely removed mrlegacy and its transients from spark > and > > spark-shell dependencies. > > > > So non-cli applications (shell scripts and embedded api use) actually > only > > need spark dependencies (which come from SPARK_HOME classpath, of course) > > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and > > optionally mahout-spark-shell (for running shell)). > > > > This of course still doesn't address driver problems that want to throw > > more stuff into front-end classpath (such as cli parser) but at least it > > renders transitive luggage of mr-legacy (and the size of worker-shipped > > jars) much more tolerable. > > > > How does that sound? > > >
Re: Codebase refactoring proposal
+1 On Fri, Jan 23, 2015 at 6:04 PM, Dmitriy Lyubimov wrote: > So right now mahout-spark depends on mr-legacy. > I did quick refactoring and it turns out it only _irrevocably_ depends on > the following classes there: > > MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... > *sigh* o.a.m.common.Pair > > So I just dropped those five classes into new a new tiny mahout-hadoop > module (to signify stuff that is directly relevant to serializing thigns to > DFS API) and completely removed mrlegacy and its transients from spark and > spark-shell dependencies. > > So non-cli applications (shell scripts and embedded api use) actually only > need spark dependencies (which come from SPARK_HOME classpath, of course) > and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and > optionally mahout-spark-shell (for running shell)). > > This of course still doesn't address driver problems that want to throw > more stuff into front-end classpath (such as cli parser) but at least it > renders transitive luggage of mr-legacy (and the size of worker-shipped > jars) much more tolerable. > > How does that sound? >
Codebase refactoring proposal
So right now mahout-spark depends on mr-legacy. I did quick refactoring and it turns out it only _irrevocably_ depends on the following classes there: MatrixWritable, VectorWritable, Varint/Varlong and VarintWritable, and ... *sigh* o.a.m.common.Pair So I just dropped those five classes into new a new tiny mahout-hadoop module (to signify stuff that is directly relevant to serializing thigns to DFS API) and completely removed mrlegacy and its transients from spark and spark-shell dependencies. So non-cli applications (shell scripts and embedded api use) actually only need spark dependencies (which come from SPARK_HOME classpath, of course) and mahout jars (mahout-spark, mahout-math(-scala), mahout-hadoop and optionally mahout-spark-shell (for running shell)). This of course still doesn't address driver problems that want to throw more stuff into front-end classpath (such as cli parser) but at least it renders transitive luggage of mr-legacy (and the size of worker-shipped jars) much more tolerable. How does that sound?