Re: [Math] Review of "genetic algorithm" module

Gilles Sadowski Sun, 01 May 2022 15:19:31 -0700

Hello.

Le dim. 1 mai 2022 à 10:07, Avijit Basak <[email protected]> a écrit :
>
> Hi All
>
>          Please find my responses.
>
> >Currently; it prints everything on "stderr" (so that simple usage of
> >the UNIX shell's "pipe" syntax is not possible).
> -- This is the default behaviour of slf4j-simple.
> It is possible to provide specific system properties as VM args to redirect
> the log to desired location like stdout or file.
> Kindly look into the documentation provided below:
> https://www.slf4j.org/api/org/slf4j/impl/SimpleLogger.html


If we need to document software (e.g. the "examples") produced
by a "Commons" component, it is preferable that we use and refer
to "Log4j2".
[The logger API is another matter since it does not impact how the
software is used (from a user's POV).]

> >
> >> There was a discussion
> >> earlier on this and the decision was to use console for example modules
> >> log messages.
> >
> >I recall that we want a separation between the logger API (library
> >dependency) and logger implementation ("examples" dependency).
> >
> >Is it a feature of "slf4j-simple" to only print to "stderr" or can it
> >be configured?
> >There is no issue on the "examples" module to depend on a
> >full-featured logger (primary candidate would be "Log4j2").
> --Already mentioned earlier.

We should change SimpleLogger -> Log4j2 equivalent.

> >
> >> [...]
> >
> >Also, IMO "examples" codes should make a difference
> >between "logging" (optional information about internals)
> >and "output" (mandatory result).
> >For example, the TSP application result (itinerary and travel
> >distance) could be saved in a file whose name is given as
> >argument to a mandatory "--output" command-line option, as
> >in module "commons-math-examples/examples-sofm/tsp".
> --I have made the changes to accommodate the --output as input argument.

Thanks.

> >
> >> >
> >> >There is still a mix between library code and application code (but
> >> >this is to be discussed in MATH-1643.
> >> -- Kindly update the JIRA with your suggestions.
> >
> >Sure, we'll come to that.  But I think we should first converge to
> >a codebase that can be merged to "master" as a replacement of
> >the "o.a.c.math4.genetics" package (to be removed).
> >Do you confirm that all the unit tests in that package have an
> >equivalent in the new "commons-math-ga" module?
> -- I have accommodated mostly. But some are altered due to changes in
> design. But it is good to have a review of the same.

Doing manually will be too tedious.
If you are reasonably confident that you've ported everything
valuable, I guess that we'll rely on coverage tools...

> >
> >> >From browsing the library code, I'm tempted to believe that the
> >> >dependency towards a logging framework is not necessary (or
> >> >underused).
> >> -- Do you mean dependency on the slf4j?
> >
> >Yes.
> >
> >>   I think that such a feature could be left to the application
> >> >layer (per the "ConvergenceListener" registry).
> >> -- ConvergenceListener is only to log the nature of convergence,
> >> not internal details of all operators. Currently we have a very low
> >> number of log messages which can be increased later.
> >
> >"Increased later" for what purpose?
> >Logging can be very useful (e.g. for debugging); however once
> >this new implementation is trusted to behave correctly, I don't see
> >that additional logging statements inside the library will be very
> >useful; they would rather be needed by the application in order to
> >e.g. display the content of the current "Population".  The "data" (to
> >be eventually displayed) would be provided by the library through
> >regular accessors.
> >Another case in point is that the library cannot know what display
> >format is suitable for the application (e.g. a GUI).
> >
> >If we really need (not clear as of the current codebase) to provide
> >runtime access to other parts of the library (operators, etc.), why
> >not generalise the "observer" pattern?
> -- IMHO fine grain log statements are a better alternative than operator
> level observer to convey internal behavior of operators for the specific
> application.
> Use of an observer for each operator is a bit unnecessary.

The current log statements could also be construed as (totally)
unnecessary, as they merely document statements which I would
consider part of an "application", not the GA "library".
I believe that we do not agree yet on where to draw the line (my
proposal in MATH-1618 is related to that difference of opinion).

> >> >Likewise, the "PopulationStatisticsLogger" is not general enough to
> >> >be worth being part of the library.
> >> -- I would love to keep this simple implementation of convergence
> >> listener. This can provide a very basic overview on the nature of
> >> convergence.
> >
> >As a user, you'll be able to provide the feature to your application
> >with less than 10 lines of code (by registering a "listener").
> >Everyone else might want a slightly different output (contents and/or
> >format) than what this class generates.
> -- Users would always have the freedom to register their own listener.
> PopulationStatisticsLogger provides only a very basic option to track the
> population statistics during the convergence process.
> This is never a very generic solution to meet the needs of all users.

AFAICT, the "PopulationStatisticsLogger" class satisfies *your* needs
(as a user), but as developers we should only include functionality that
is broadly useful.  I think I provided arguments that it is not the case of
that class.

> >Yes.
> >Not everybody follows strict rules, and certainly not all "Commons"
> >components have the same rules (unfortunately), but since it was
> >decided that this module belongs in "Commons Math", I'd like to
> >minimize inconsistent coding styles (converging on what is used in
> >non-"legacy" source files).
> >
> >> This would
> >> initiate changes in almost all classes.
> >
> >Do you mean "all classes" in the GA module?
> >Then so be it. ;-)
> -- I shall start to work on this.

Great!

> >> [...]
> >> >In "AdaptiveGeneticAlgorithm":
> >> >* There should be a single constructor (same remark as above).
> >> -- Removed the constructor with default argument.
> >>
> >> >* Why the use of reflection ("isAssignableFrom")?
> >> -- Replaced it by instanceof.
> >
> >Marginally better ;-) it still does not say why the statistics is disabled
> >depending on the operator type...
> --Statistics is required only for the adaptive probability generation.
> In case application users choose to use a constant probability generation
> scheme there is no use of this computation.
> I shall document the same.

Maybe I'm still missing something, but IMHO the distinction between
adaptive vs not should not impact code at that level.

> >
> >(1)
> >o.a.c.m.ga.GeneticAlgorithmTestPermutations
> >(under "src/test")
> >
> >As per your comment in that class, it is a usage example.
> >Given that its name does not end with "Test", it is not run by the
> >test suite.  Please move it to the "examples" module.
> -- This was taken from legacy GA. There is another example of binary
> genotype called onemax. We already have examples in the "examples-ga"
> module.
> If we don't need these in test packages we can delete the same. But in that
> case there won't be any end-to-end test case execution scenario during
> build. Are we fine with that?

I don't know.  The underlying question is whether this class is executed
during the execution of the unit tests; if not, it doesn't make any sense
there.  If yes, it should be documented.
We should strive for maximum coverage (through unit tests) but "dummy"
unit tests that don't actually test anything (and whose only purpose is to
"mechanically" increase coverage) should be avoided IMO.

> >
> >(2)
> >I'm missing a high-level doc that would enable a newbie to figure
> >out what to implement in order to get going.
> >E.g. what is the interplay between
> > * genotype
> > * allele
> > * phenotype
> > * decoder
> > * fitness function
> >?
> >Several classes do not provide explanations (or links) about the
> >concept which they represent.  For example, there is no doc about
> >what a "RandomKeyDecoder" is, and the reason for using it (or not).
> -- I need to work on that.

That would be very useful; thanks.

> >
> >(3)
> >o.a.c.m.ga.utils.ChromosomeRepresentationUtils
> >
> >It seems to be a "mixed-bag" kind of class (that is being frowned
> >upon nowadays).
> >Its comment refers to "random" but some methods are not using
> >any randomization.  Most methods are only used in unit tests.
> >
> -- Actually most methods are taken from the legacy GA application. In the
> legacy library representation there was no separate concept for the decoder
> and all representation utility methods were kept in the corresponding
> Chromosome classes. ChromosomeRepresentationUtils is created as a
> placeholder for those utility methods.

I'm not sure I follow what you mean.
If these methods are only needed by the (legacy?) examples, they
should be moved to the "ga-examples" module (where they can be
removed later on without regards to BC).

> >(4)
> >o.a.c.m.ga.RandomProviderManager
> >
> >As already discussed, this class should not be part of the public
> >API, namely because the "getRandomProvider()" method returns
> >an object that is not thread-safe.
> >If used internally as "syntactic sugar", it should be located in a
> >package named "internal"; however I'd tend to remove it
> >altogether, and call "ThreadLocalRandomSource.current(...)"
> >explicitly.
> >
> -- The class o.a.c.m.ga.RandomProviderManager has been removed.
> As per our previous discussion I have implemented the customization of
> RandomSource at each operator level.

OK.  [Is there a new PR?]

> >(5)
> >Why does a "Chromosome" need an "identifier"?
> >Method "getId()" is only used in "PopulationStatisticalSummaryImpl"
> >that is an internal class, where it seems that the chromosome itself
> >(rather than its "id") could serve as the map's key.
> -- How can we generate a map's key from the chromosome itself? Please
> suggest.

A class to be used as a key only needs to implement "equals" and
"hashCode".

> >
> >(6)
> >o.a.c.m.ga.chromsome.AbstractChromosome
> >
> >Field "fitness" is not "final", yet it could be: a "FitnessFunction"
> >object (used in "evaluate() to compute that field) is passed to the
> >constructor.  Is there a reason for the "lazy" evaluation?
> >Dropping it would make the instance immutable (and "evaluate()"
> >should be renamed to "getFitness()").
> >
> >Why should the "FitnessFunction" be stored in every chromosome?
> >
> -- I have modified the fitness as final and initialized the same in the
> constructor.

Better, but did you check my proposal in MATH-1618, where
Chromosome and fitness are decoupled, and their relationship
is held within a "Population" instance?

> >(7)
> >Spurious "@since" tags: In the new code (in "commons-math-ga"
> >module), none should refer to a version < 4.0.
> >
> -- Some files are taken unchanged from the previous release. I have kept
> the same @since tag for those files.
> Do you need any change here?

The old and new files are in different packages; "@since" tags
thus make no sense IMO.

>
> >(8)
> >@SuppressWarnings("unchecked")
> >
> >By default, I'm a bit suspicious about having to resort to these
> annotations,
> >especially for the kind of algorithms we are trying to implement.
> >What do you think of the alternative approach outlined in the ZIP file
> >attached in MATH-1618:
> >    https://issues.apache.org/jira/browse/MATH-1618
> >?
> -- This annotation is required because we have kept an option to use
> different types of genotypes including primitive.
> Because of that our base interfaces only declares phenotype not genotype.
> This introduced a kind of hierarchy in all operators and chromosome classes
> which required us to use the mentioned annotation.

I may again be missing something.
Could you please explain the case that makes these annotations
necessary.

> Even with the proposed new architecture we may not be able to avoid the
> same.

The classes which I've added do not use the annotation...

> -- It will be good if you can share some more information about the newly
> proposed architecture. The areas of current design which it can improve as
> well as the underlying intention.

As noted in the comment on the JIRA page, the main intention is
maximal decoupling of functionalities that make up a GA (population,
fitness, selection, operator) and that seems achieved with the provided
classes.

> >
> >(9)
> >Naming of factory methods should be harmonized to match the convention
> >adopted in components like [RNG] and [Numbers].
> >E.g. instead of "newChromosome(...)", please use "of(...)" or "from(...)"
> >for "value object", and "create(...)" otherwise.
> >
> -- I have renamed the same for Chromosome classes.
> What about the nextGeneration() method of ListPopulation class. Renaming
> this to create() or from() won't convey the purpose of it.

I agree, and that's why the new "Population" class (in MATH-1618) does
not provide a factory method (see also the "GeneticAlgorithmFactory"
class).

> >(10)
> >o.a.c.m.ga.chromosome.AbstractListChromosome
> >
> >Constructor is called with an argument that is a previously instantiated
> >"representation".  If the latter is mutable, the caller will be able to
> modify
> >the underlying data structure of the newly created chromosome.  [The
> >doc assumes immutability of the representation but this cannot be
> >enforced, and mixed ownership can entail subtle bugs.]
> -- I think this applies to both representation as well as generic parameter
> type T. But I don't see any other option but to rely on the user.

The Javadoc (at line 84) is misleading in its mention of "immutable".

> If you have any suggestions kindly share.

I may not understand all the implications, but I'd suggest that the
"representation" be instantiated within the control of the library (e.g.
through a "builder"/"factory").

> >
> >(11)
> >Do we agree that, in a GA, the most time-consuming task is the fitness
> >computation?  Hence IMO, it should be the focus of the multithreading
> >tools (i.e. "ExecutorService"), probably keeping the other parts (namely
> >the genetic operators) within a simple sequential loop (as in class
> >"GeneticAlgorithmFactory" in MATH-1618).
> -- Current implementation uses separate threads for applying crossover and
> mutation operators for each pair of selected chromosomes.
> I think this ensures better utilization of multi-core processors compared
> to use of multi-threading only for the fitness calculation.

I have the opposite intuition: Parallel application of the genetic
operators would only provide marginal gains wrt the fitness
computation.
In any case, I think that it will be fairly easy to modify my proposed
"OffspringGenerator" class to use an "ExecutorService" (if benchmarks
show that a substantial gain could indeed be achieved).

> -- Some codes are checked in. But there is a conflict in the pull request.
> So I shall create a new one and delete the old branch itself.

IMHO, we could make more substantial progress if you could
first point to issues with my proposal in MATH-1618.

Thanks,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Math] Review of "genetic algorithm" module

Reply via email to