Re: Codebase refactoring proposal

Dmitriy Lyubimov Sun, 08 Feb 2015 22:38:50 -0800

I dont know why. I said i didnt see either as a problem. As far as i am
concerned. Had encountered both needs in the past, did not even notice it
was a problem. Both are not relevant to this thread. Not sure. Id suggest
starting a separate thread.


Speaking of my priorities, two biggest problems i see is in-core
performance and tons of archaic dependencies. But only one belongs here.
3rd biggest problem is general bugs and code tidiness.
On Feb 8, 2015 8:22 PM, "Pat Ferrel" <[email protected]> wrote:

> OK, well perhaps those two lines of code (actually I agree, there’s not
> much more) can be also applied to TF-IDF and several other algorithms to
> get a much higher level or interoperability and keep us from reinventing
> things when not necessary. Funny we have type conversions for so many
> things *but* MLlib. I’ve been arguing for what a uneven state MLlib is in
> but it does solve problems we don’t need to reinvent. Frankly adopting the
> best of MLlib makes Mahout a superset along with all its other virtues.
>
> And yes, I forgot to also praise the DSL’s optimizer—now rectified.
>
> Why do we spend more time with engine agnostic decisions that these more
> pragmatic ones?
>
>
> On Feb 8, 2015, at 7:55 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> The conversion from DRM to rdd of vectors for kmeans is one line. Kmeans
> application and conversion back is another line. I actually did that some
> time ago. I am sure you  can figure the details.
>
> Whether it is worth to retain some commonality, no, it is not worth it
> untill there's commonality across mllib.
>
> At which point we may just include conversions for those who is interested.
> Until  then all we can do is to maintain commonality with mllib kmeans
> specifically but not mllib as a whole.
> On Feb 8, 2015 7:45 PM, "Pat Ferrel" <[email protected]> wrote:
>
> > I completely understand that MLlib lacks anything like the completeness
> of
> > Mahout's DSL, I know of no other scalable solution to match.  I don’t
> know
> > how many times this has to be said. This is something we can all get
> behind
> > as *unique* to Mahout.
> >
> > But I stand by the statement that there should also be some lower level
> > data commonality. There is too much similarity to dismiss and go
> completely
> > non-overlapping ways. Even if you can ague for maintaining separate
> > parallel ways let’s have some type conversions (I hesitate to say easy to
> > use) They shouldn’t be all that hard.
> >
> > A conversion of DRM of o.a.m.Vector to rdd of MLlib Vector and back would
> > solve my Kmeans use case. You know MLlib better than I so choose the best
> > level to perform type conversions or inheritance splicing. The point is
> to
> > make the two as seamless as possible. Doesn’t this seem a worthy goal?
> >
> > On Feb 8, 2015, at 4:59 PM, Dmitriy Lyubimov <[email protected]> wrote:
> >
> > Pat,
> >
> > I *just* made a case in this thread explaining that mllib does not have a
> > single distributed matrix types and that its own methodologies do not
> > interoperate within itself for that reason. Therefore, it is
> fundamentally
> > impossible to be interoperable with mllib since nobody really can define
> > what it means in terms of distributed types.
> >
> > You are in fact referring  to their in-core type, not a distributed type.
> > But there's no linear algebra operation support to speak of there either.
> > It is, simply, not algebra, at the moment. The types in this hierarchy
> are
> > just memory storage models, and private scope converters to breeze
> storage
> > models, but they are not true linalg apis nor providers of such.
> >
> > One might concievably want to standardize on Breeze apis since those are
> > both linalg api and providers, but not the type you've been mentioning.
> >
> > However, it is not a very happy path either. Breeze is somewhat more
> > interesting substrate to build in-core operations on, but if you read
> spark
> > forum of late, even spark developers express a whiff of dissatisfaction
> > with it in favor of BIDMat (me too btw). But while they say Bidmat would
> be
> > a better choice for in-core operatros, they also recognize the fact that
> > they are too invested into breeze api by now and such move would not be
> > cheap across the board.
> >
> > And that demonstrates another problem on in-core mllib architectrue
> there:
> > on one side, they don't have sufficient public in-core dsl or api to
> speak
> > of; but they also do not have a sufficiently abstract api for in-core
> blas
> > plugins either to be truly agnostic of the available in-core
> methodologies.
> >
> > So what you are talking about, is simply not possible with current state
> of
> > things there. But if it were, i'd just suggest you to try to port
> algebraic
> > things you like in Mahout, to mllib.
> >
> > My guess however is that you'd find that porting algebraic optimizer with
> > proper level of consistency with in-core operations will not be easy for
> > reasons including, but not limited to, the ones i just mentioned;
> although
> > individual blas  like matrix square you've mentioned would be fairly easy
> > to do for one of the distributed matrix types in mllib. But that of
> course
> > would not be an R like environment and not an optimizer.
> >
> > I like bidmat a lot though; but it is not truly hybrid and self-adjusting
> > environment for in-core operations either (and its dsl is neither Rlike
> nor
> > matlab like, so it takes a bit of adjusting to). For that reason even
> > Bidmat linalg types and dsl are not truly versatile enough for our (well,
> > my anyway) purposes (which are to find the best hardware or software
> > subroutine automatically given current hardware and software platform
> > architecture and parameters of the requested operation).
> > On Feb 8, 2015 9:05 AM, "Pat Ferrel" <[email protected]> wrote:
> >
> >> Why aren’t we using linalg.Vector and its siblings? The same could be
> >> asked for linalg.Matrix. If we want to prune dependencies this would
> help
> >> and would also significantly increase interoperability.
> >>
> >> Case-now: I have a real need to cluster items in a CF type input matrix.
> >> The input matrix A’ has row of items. I need to drop this into a
> sequence
> >> file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into
> an
> >> RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too
> bad
> >> and maybe could be helped with some implicit conversions mahout.Vector
> > <->
> >> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for
> >> Kmeans).
> >>
> >> Case-possible: If we adopted linalg.Vector as the native format and
> >> perhaps even linalg.Matrix this would give immediate interoperability in
> >> some areas including my specific need. It would significantly pare down
> >> dependencies not provided by the environment (Mahout-math). It would
> also
> >> support creating distributed computation methods that would work on
> MLlib
> >> and Mahout datasets addressing Gokhan’s question.
> >>
> >> I looked at another “Case-now” possibility, which was to go all MLlib
> > with
> >> item similarity. I found that MLlib doesn’t have a transpose—“transpose,
> >> why would you want to do that?” Not even in the multiply form A’A, A’B,
> >> AA’, all used in item and row similarity. That stopped me from looking
> >> deeper.
> >>
> >> The strength and unique value of Mahout is the completeness of its
> >> generalized linear algebra DSL. But insistence on using Mahout specific
> >> data types is also a barrier for Spark people adopting the DSL. Not
> > having
> >> lower level interoperability is a barrier both ways to mixing Mahout and
> >> MLlib—creating unnecessary either/or choices for devs.
> >>
> >> On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <[email protected]> wrote:
> >>
> >> On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <[email protected]> wrote:
> >>
> >>> What I am saying is that for certain algorithms including both
> >>> engine-specific (such as aggregation) and DSL stuff, what is the best
> > way
> >>> of handling them?
> >>>
> >>> i) should we add the distributed operations to Mahout codebase as it is
> >>> proposed in #62?
> >>>
> >>
> >> Imo this can't go very well and very far (because of the engine
> > specifics)
> >> but i'd be willing to see an experiment with simple things like map and
> >> reduce.
> >>
> >> Bigger quesitons are, where exactly we'll have to stop (we can't
> abstract
> >> all capabilities out there becuase of "common denominator" issues), and
> >> what percentage of methods will it truly allow to migrate to full
> backend
> >> portability.
> >>
> >> And if after doing all this, we will still find ourselves writing engine
> >> specific mixes, why bother. Wouldn't it be better to find a good,
> >> easy-to-replicate, incrementally-developed pattern to register and apply
> >> engine-specific strategies for every method?
> >>
> >>
> >>>
> >>> ii) should we have [engine]-ml modules (like spark-bindings and
> >>> h2o-bindings) where we can mix the DSL and engine-specific stuff?
> >>>
> >>
> >> This is not quite what i am proposing. Rather, engine-ml modules holding
> >> engine-specific _parts_ of algorithm.
> >>
> >> However, this really needs a POC over a guniea pig (similarly to how we
> >> POC'd algebra in the first place with ssvd and spca).
> >>
> >>
> >>>
> >>>
> >>
> >>
> >
> >
>
>

Re: Codebase refactoring proposal

Reply via email to