Re: [MATH][GENETICS][PR-199] Decision on use and customization of RNG functionality for randomization

Gilles Sadowski Wed, 22 Dec 2021 07:02:33 -0800

Hello.

Le mer. 22 déc. 2021 à 14:25, Avijit Basak <[email protected]> a écrit :
>
> Hi All
>
>         Please see my comments below.
>
> >> >Several problems with this approach (raised in previous messages IIRC):
> >> >1. Potential performance loss in sharing the same RNG instance.
> >> -- As per my understanding ThreadLocalRandomSource creates separate
> >> instances of UniformRandomProvider for each thread. So I am not sure how
> a
> >> UniformRandomProvider instance is being shared. Please correct me if I am
> >> wrong.
>
> >Within a given thread there will be *one* RNG instance; that's what I meant
> >by "shared".
> >Of course you are right that that instance is not shared by multiple
> threads
> >(which would be a bug).
> >The performance loss is because it will be necessary to call
> >  ThreadLocalRandomSource.current(RandomSource source)
> >for each access to the RNG (since it would be a bug to store the returned
> >value in e.g. an operator instance that would be shared among threads (as
> >you suggest below).
>
> -- I tried to do a small test on it and here are the results. Output times
> are in milliseconds. According to my understanding the performance loss is
> mostly during creation of per thread instance of UniformRandomProvider.
> --*CUT*--
>     @Test
>     void test() {
>         int limit = 1;
>         long start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 1000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 10000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 100000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 1000000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 10000000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 100000000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>
>         limit = 1000000000;
>         start = System.currentTimeMillis();
>         for (int i = 0; i < limit; i++) {
>             ThreadLocalRandomSource.current(RandomSource.JDK);
>         }
>         System.out.println(System.currentTimeMillis() - start);
>     }
> --*CUT*--
> --*output*--
> 363
> 1
> 2
> 4
> 6
> 28
> 244
> 2423
> --*output*--


As I've already indicated, "ThreadLocalRandomSource" is, IMHO, a
sort of workaround for a multi-thread application that does not want
to bother managing per-thread RNG instance(s).
The library should not make that decision for the application since we
can care for both usages: Every piece of the GA that needs a RNG can
provide factory methods that either take a "RandomSource" argument
or create a default one.

Note that your above custom benchmark is likely to mean nothing
(please see e.g. "Commons RNG" on how to create JMH based
benchmarks).

>
> >> >2. Less/no flexibility (no user's choice of random source).
> >> -- Agreed.
> -- Do we really need this much flexibility here?

My main concern is that IMO the RNG is a prominent part of a GA
and it is not a good design to use "ThreadLocalRandomSource".

> >> >3. Error-prone (user can access/reuse the "UniformRandomProvider"
> >> instances).
> >>
> >> >Again: "ThreadLocalRandomSource" is an ad-hoc workaround for correct but
> >> >"light" usage of random number generation in a multi-threaded
> application;
> >> GAs
> >> >make "heavy" use of RNG, thus it is does not seem outlandish that all
> the
> >> RNG
> >> >"clients" (e.g. every "operator") creates their own instances.
> >
> >
> >> >IMHO, a more important discussion would be about the expectations in a
> >> >multithreaded context: E.g. should an operator be shareable by different
> >> >threads?  And if not, how does the API help application developers to
> avoid
> >> >such pitfalls?
> >> -- Once we implement multi-threading in GA, same crossover and mutation
> >> operators will be re-used across multiple threads.
>
> >I would be wary to go on that path; better consider making (deep) copies.
> >We can have multiple instances of an operator, all being configured in the
> >same way but being different instances with no risk of a multithreading
> bug.
>
> -- I don't think this would be a good design choice just to support
> customization of RNG functionality. This will lead to too many instances of
> the same operators resulting in lots of unnecessary memory consumption. I
> think we might face memory issues for higher dimensional problems. As
> population size requirement also increases with increase of dimension this
> might lead to a major issue and need a thought.

How many is "too many instances"?
The memory used by an operator is tiny compared to a chromosome,
even less to a population of chromosome, or two populations of them
(parents and offsprings).

>     So I think we have a design tradeoff here performance vs memory
> consumption. I am more worried about memory as that might restrict use of
> this library beyond a certain number of dimensions in some areas.

I'm referring to separate copies for each thread.
How many threads/virtual CPUs are common nowadays?

> However,
> creating deep copy would only be possible when we strictly restrict
> extension of operators which I want to avoid.

How to avoid deep copies in a multi-thread library?
Through synchronization?

>
> >> So even if we provide
> >> the customization at the operator level we cannot avoid sharing.
>
> >We can, and we should.
> >What we probably can't avoid sharing is the instance that represents the
> >population of chromosomes.
> *--* In a multi-threaded optimization the chromosome instances are shared
> in case the same chromosome is chosen for crossover by the selection
> process. I missed this point earlier.
> ...

Chromosomes can be shared (if they are read-only).

>
> >> >  Mine is against using "ThreadLocalRandomSource"...
> >> -- What is the wayout other than that. Please suggest.
>
> >I think I did.
> *--* The factory based approach would be useful only when we can have
> separate copies of operators for each set of operations.

If we don't have separate copies in each thread, then the operator
will not be multithreaded...

> >Maybe it's time to create a dedicated branch for the GA functionality
> >so that we can try out the different approaches.
>
>
> >
> > >> I think first we need to decide on whether we really need this
> > >> customization and if yes then why. Then we can decide on alternate
> > >> implementation options.
> > >
> > >> >As per the recent updates of the math-related code bases, the
> > >> >public API should provide factory methods (constructors should
> > >> >be private).
> > >> -- private constructors will make public API classes non-extensible.
> This
> > >> will severely restrict the extensibility of this framework which I want
> > to
> > >> avoid. I am not sure why we need to remove public constructors. It
> would
> > be
> > >> helpful if you could refer me to any relevant discussion thread.
> >
> > >  Allowing extensibility is a huge burden on library maintainers.  The
> > >  library must have been designed to support it; hence, you should
> > >  first describe what kind(s) of extensions (with usage examples) you
> > >  have in mind.
> > --The library should be extensible to support customization. Users should
> > be able to customise or provide their own implementation of genetic
> > operators for crossover and mutation. The chromosome classes should also
> be
> > open for extension.
>
> >I don't get why we should support extensions outside this library.
> *--* I think we should not block the extension.

This would be going backwards to many things that have been done
to improve the robustness and reduce the bug counts of the Commons
Math codes.

>
> >Initially we discussed about having a light-weight library, for easier
> usage
> >than alternative existing framework(s).
> *--* We can always think of making the framework lightweight but it should
> not cost extensibility.

There is no cost: We'll gladly merge every worthy extension into
the Commons component.

>
> >> E.g. any developer should be able to extend the
> >> IntegralChromosome class and define a child class which explicitly
> >> specifies the range of integers to be used.
>
> >It does not look like this would need an extension, only configuration
> >of the range.
> *-- *I agree. But the question is should we block the extension.

Please find a valid use case. ;-)

>
> >> I have initially implemented
> >> the Binary chromosome and the corresponding binary mutation following the
> >> same pattern. However, restricting extension of concrete classes by
> private
> >> constructor does not prevent users from extending the abstract parent
> >> classes.
>
> >We should aim at coding the GA logic through (Java) interfaces, and not
> >expose the "abstract" classes.
> *-- *One of the primary reasons for me to contribute in Apache' GA library
> is it's simplicity and extensibility.

"Extensibility" does not necessarily imply "inheritance"-based.
In fact, we do want to *avoid* in order to more easily and more robustly
provide other advantages such as multi-threading.

> I would like to have a framework
> which should be always extensible for any problem domain with minor
> changes.

Any problem domain should indeed be amenable to be solved
by the library; I don't see how that should imply a design based
on inheritance.

> The primary reason behind this is that application domains of GA
> are too diverse. It is not possible to implement everything in a library.
> We don't know all possible domain areas too. If we remove the extensibility
> from the framework it would be useless in lots of areas.

When that occurs, people are welcome to contribute back if
something they need is missing.
Your argument of "too much diversity" can be reversed, in that
it is unlikely that one library would attract everyone that needs a
genetic algorithm.
Better make a design that can handle a fraction of use cases,
and grow as needed.

>
> >Extending the functionality, if necessary, should be contributed back here
> *-- *Sometimes the GA operators are very much specific to the domain and
> it's hard to generalise. In those scenarios contributing back to the
> library might not be possible.

In such a case, how likely will it also be that whatever general
framework this library has put in place, will also not be amenable
to that domain's specifics?
There is always a scope from which design decisions must be taken.

If "multi-threading" is in the scope, then the design must avoid
inheritance (in public classes) in order to much more easily
ensure the correctness of applications.

> However, if a library cannot be extended for
> a new domain by users it becomes underutilised over time if not useless.

Sure but that is a hypothetical for the long-term.
However, if the library is buggy or slow, it will not be used at all.

Regards,
Gillles

>>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [MATH][GENETICS][PR-199] Decision on use and customization of RNG functionality for randomization

Reply via email to