Re: The case for a Commons component

Gilles Sadowski Mon, 26 Apr 2021 10:14:16 -0700

Le dim. 25 avr. 2021 à 16:27, sebb <seb...@gmail.com> a écrit :
>
> I assume this thread is about the possible ML component.

I hesitated with Subject: "The case for *any* Commons component".

> If the code was developed by Commons, I assume it could be used as
> part of Spark.
> However Commons does not currently have many developers who are
> familiar with the field.
> So it would seem to me better to have development done by a project
> which does have relevant experience.

I expressed the same concern/opinion; in fact, if I were tempted
to implement something of the like now, I would probably indeed
start experimenting with Spark. [CM's implementation of SOFM
dates from early 2014.]

On the other hand, several people (at different times) expressed
an interest of having such codes free of the "high-level" features
that come with the "platforms".
My own current usage of the "neuralnet" package does not
warrant a move to Spark.
I'm also interested in refactoring the "clustering" package (but will
not pursue it alone).

> You say that Spark etc have lots of jars.
> Surely that allows for it to be implemented as a separate jar which
> can either be used as part of the Spark platform, or used
> independently?

https://spark.apache.org/docs/latest/spark-standalone.html

TL;DR; but there are many references to a "cluster", so that seems
the common use-case, while code here could for example focus on
multi-thread-ready components, primarily targetting applications that
run in a single multi-core machine).

> The only other option I see is for Commons to persuade some developers
> who are familiar with the field to join Commons to assist with the
> algorithms.
> Existing Commons developers can help manage the logistics of packaging
> and releasing the code, as this does not require in depth knowledge of
> the design.
> However this only makes sense if the developers skilled in the are are
> prepared to assist long-term.

I try to make that crystal-clear to every new contributor (cf. proposal to
revive "Commons Graph", the exchange on refactoring  the "clustering"
package, the necessary features for a GA implementation that purports
to be more than a toy example, ...).

However, it is obviously impossible to enforce something as "prepared
to assist long-term"; it is rightfully a necessary condition for being
granted commit access, but it's up to the project to create a "place"
where people want to stay (and know what to expect).
For people interested in "ML" (not necessarily experts: They could be
developers willing to implement standard algorithms, as we did in CM),
it means that there should be global guidelines (like there were for CM)
such as e.g. "multi-thread-ready" (in addition to the usual "full doc",
"full coverage", etc.), and a repository for those codes.

We don't have much grasp on the arrival rate of contributors but I
contend that a component with a specific scope is much more
appealing (especially to newcomers) than a mixed bag à la CM
which nobody here is able (or willing) to maintain (and the reason
why I'll only merge bug-fixes).

Not creating the "place" will of course pave the way to a self-fulfilling
prophecy.

Gilles

> On Sat, 24 Apr 2021 at 23:32, Paul King <paul.king.as...@gmail.com> wrote:
> >
> > Thanks Gilles,
> >
> > I can provide the same sort of stats across a clustering example
> > across commons-math (KMeans) vs Apache Ignite, Apache Spark and
> > Rheem/Apache Wayang (incubating) if anyone would find that useful. It
> > would no doubt lead to similar conclusions.
> >
> > Cheers, Paul.
> >
> > On Sun, Apr 25, 2021 at 8:15 AM Gilles Sadowski <gillese...@gmail.com> 
> > wrote:
> > >
> > > Hello Paul.
> > >
> > > Le sam. 24 avr. 2021 à 04:42, Paul King <paul.king.as...@gmail.com> a 
> > > écrit :
> > > >
> > > > I added some more comments relevant to if the proposed algorithm
> > > > belongs somewhere in the commons "math" area back in the Jira:
> > > >
> > > > https://issues.apache.org/jira/browse/MATH-1563
> > >
> > > Thanks for a "real" user's testimony.
> > >
> > > As the ML is still the official forum for such a discussion, I'm quoting
> > > part of your post on JIRA:
> > > ---CUT---
> > > For linear regression, taking just one example dataset, commons-math
> > > is a couple of library calls for a single 2M library and solves the
> > > problem in 240ms. Both Ignite and Spark involve "firing up the
> > > platform" and the code is more complex for simple scenarios. Spark has
> > > a 181M footprint across 210 jars and solves the problem in about 20s.
> > > Ignite has a 87M footprint across 85 jars and solves the problem in >
> > > 40s. But I can also find more complex scenarios which need to scale
> > > where Ignite and Spark really come into their own.
> > > ---CUT---
> > >
> > > A similar rationale was behind my developing/using the SOFM
> > > functionality in the "o.a.c.m.ml.neuralnet" package: I needed a
> > > proof of concept, and taking the "lightweight" path seemed more
> > > effective than experimenting with those platforms.
> > > Admittingly, at that epoch, there were people around, who were
> > > maintaining the clustering and GA codes; hence, the prototyping
> > > of a machine-learning library didn't look strange to anyone.
> > >
> > > Regards,
> > > Gilles
> > >
> > > >>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: The case for a Commons component

Reply via email to