I dont know why. I said i didnt see either as a problem. As far as i am concerned. Had encountered both needs in the past, did not even notice it was a problem. Both are not relevant to this thread. Not sure. Id suggest starting a separate thread.
Speaking of my priorities, two biggest problems i see is in-core performance and tons of archaic dependencies. But only one belongs here. 3rd biggest problem is general bugs and code tidiness. On Feb 8, 2015 8:22 PM, "Pat Ferrel" <[email protected]> wrote: > OK, well perhaps those two lines of code (actually I agree, there’s not > much more) can be also applied to TF-IDF and several other algorithms to > get a much higher level or interoperability and keep us from reinventing > things when not necessary. Funny we have type conversions for so many > things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in > but it does solve problems we don’t need to reinvent. Frankly adopting the > best of MLlib makes Mahout a superset along with all its other virtues. > > And yes, I forgot to also praise the DSL’s optimizer—now rectified. > > Why do we spend more time with engine agnostic decisions that these more > pragmatic ones? > > > On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov <[email protected]> wrote: > > The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans > application and conversion back is another line. I actually did that some > time ago. I am sure you can figure the details. > > Whether it is worth to retain some commonality, no, it is not worth it > untill there's commonality across mllib. > > At which point we may just include conversions for those who is interested. > Until then all we can do is to maintain commonality with mllib kmeans > specifically but not mllib as a whole. > On Feb 8, 2015 7:45 PM, "Pat Ferrel" <[email protected]> wrote: > > > I completely understand that MLlib lacks anything like the completeness > of > > Mahout's DSL, I know of no other scalable solution to match. I don’t > know > > how many times this has to be said. This is something we can all get > behind > > as *unique* to Mahout. > > > > But I stand by the statement that there should also be some lower level > > data commonality. There is too much similarity to dismiss and go > completely > > non-overlapping ways. Even if you can ague for maintaining separate > > parallel ways let’s have some type conversions (I hesitate to say easy to > > use) They shouldn’t be all that hard. > > > > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would > > solve my Kmeans use case. You know MLlib better than I so choose the best > > level to perform type conversions or inheritance splicing. The point is > to > > make the two as seamless as possible. Doesn’t this seem a worthy goal? > > > > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <[email protected]> wrote: > > > > Pat, > > > > I *just* made a case in this thread explaining that mllib does not have a > > single distributed matrix types and that its own methodologies do not > > interoperate within itself for that reason. Therefore, it is > fundamentally > > impossible to be interoperable with mllib since nobody really can define > > what it means in terms of distributed types. > > > > You are in fact referring to their in-core type, not a distributed type. > > But there's no linear algebra operation support to speak of there either. > > It is, simply, not algebra, at the moment. The types in this hierarchy > are > > just memory storage models, and private scope converters to breeze > storage > > models, but they are not true linalg apis nor providers of such. > > > > One might concievably want to standardize on Breeze apis since those are > > both linalg api and providers, but not the type you've been mentioning. > > > > However, it is not a very happy path either. Breeze is somewhat more > > interesting substrate to build in-core operations on, but if you read > spark > > forum of late, even spark developers express a whiff of dissatisfaction > > with it in favor of BIDMat (me too btw). But while they say Bidmat would > be > > a better choice for in-core operatros, they also recognize the fact that > > they are too invested into breeze api by now and such move would not be > > cheap across the board. > > > > And that demonstrates another problem on in-core mllib architectrue > there: > > on one side, they don't have sufficient public in-core dsl or api to > speak > > of; but they also do not have a sufficiently abstract api for in-core > blas > > plugins either to be truly agnostic of the available in-core > methodologies. > > > > So what you are talking about, is simply not possible with current state > of > > things there. But if it were, i'd just suggest you to try to port > algebraic > > things you like in Mahout, to mllib. > > > > My guess however is that you'd find that porting algebraic optimizer with > > proper level of consistency with in-core operations will not be easy for > > reasons including, but not limited to, the ones i just mentioned; > although > > individual blas like matrix square you've mentioned would be fairly easy > > to do for one of the distributed matrix types in mllib. But that of > course > > would not be an R like environment and not an optimizer. > > > > I like bidmat a lot though; but it is not truly hybrid and self-adjusting > > environment for in-core operations either (and its dsl is neither Rlike > nor > > matlab like, so it takes a bit of adjusting to). For that reason even > > Bidmat linalg types and dsl are not truly versatile enough for our (well, > > my anyway) purposes (which are to find the best hardware or software > > subroutine automatically given current hardware and software platform > > architecture and parameters of the requested operation). > > On Feb 8, 2015 9:05 AM, "Pat Ferrel" <[email protected]> wrote: > > > >> Why aren’t we using linalg.Vector and its siblings? The same could be > >> asked for linalg.Matrix. If we want to prune dependencies this would > help > >> and would also significantly increase interoperability. > >> > >> Case-now: I have a real need to cluster items in a CF type input matrix. > >> The input matrix A’ has row of items. I need to drop this into a > sequence > >> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into > an > >> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too > bad > >> and maybe could be helped with some implicit conversions mahout.Vector > > <-> > >> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for > >> Kmeans). > >> > >> Case-possible: If we adopted linalg.Vector as the native format and > >> perhaps even linalg.Matrix this would give immediate interoperability in > >> some areas including my specific need. It would significantly pare down > >> dependencies not provided by the environment (Mahout-math). It would > also > >> support creating distributed computation methods that would work on > MLlib > >> and Mahout datasets addressing Gokhan’s question. > >> > >> I looked at another “Case-now” possibility, which was to go all MLlib > > with > >> item similarity. I found that MLlib doesn’t have a transpose—“transpose, > >> why would you want to do that?” Not even in the multiply form A’A, A’B, > >> AA’, all used in item and row similarity. That stopped me from looking > >> deeper. > >> > >> The strength and unique value of Mahout is the completeness of its > >> generalized linear algebra DSL. But insistence on using Mahout specific > >> data types is also a barrier for Spark people adopting the DSL. Not > > having > >> lower level interoperability is a barrier both ways to mixing Mahout and > >> MLlib—creating unnecessary either/or choices for devs. > >> > >> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> > >> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <[email protected]> wrote: > >> > >>> What I am saying is that for certain algorithms including both > >>> engine-specific (such as aggregation) and DSL stuff, what is the best > > way > >>> of handling them? > >>> > >>> i) should we add the distributed operations to Mahout codebase as it is > >>> proposed in #62? > >>> > >> > >> Imo this can't go very well and very far (because of the engine > > specifics) > >> but i'd be willing to see an experiment with simple things like map and > >> reduce. > >> > >> Bigger quesitons are, where exactly we'll have to stop (we can't > abstract > >> all capabilities out there becuase of "common denominator" issues), and > >> what percentage of methods will it truly allow to migrate to full > backend > >> portability. > >> > >> And if after doing all this, we will still find ourselves writing engine > >> specific mixes, why bother. Wouldn't it be better to find a good, > >> easy-to-replicate, incrementally-developed pattern to register and apply > >> engine-specific strategies for every method? > >> > >> > >>> > >>> ii) should we have [engine]-ml modules (like spark-bindings and > >>> h2o-bindings) where we can mix the DSL and engine-specific stuff? > >>> > >> > >> This is not quite what i am proposing. Rather, engine-ml modules holding > >> engine-specific _parts_ of algorithm. > >> > >> However, this really needs a POC over a guniea pig (similarly to how we > >> POC'd algebra in the first place with ssvd and spca). > >> > >> > >>> > >>> > >> > >> > > > > > >
