Hello.

On Tue, 16 Sep 2014 18:34:52 -0400, Evan and Maureen Ward wrote:
Hi Gilles, Luc,

Thanks for all the comments. I'll try to respond to the more fundamental concerns in this email, and the more practical ones in another email, if we
decide that we want to include data editing in [math].


[...]


I don't see the that the editing has to occur during the optimization.
It could be independent:
 1. Let the optimization run to completion with all the data
 2. Compute the residuals
 3a. If there are no outliers, stop
 3b. If there are outliers, remove them (or modify their weight)
 4. Run the optimization with the remaining data
 5. Goto 2

The advantage I see here is that the weight modification is a user
decision (and implementation).

However, IIUC the examples from the link provided by Evan, "robust"
optimization (i.e. handling outliers during optimization) could lead
to a "better" solution.

Would it always be "better"? Not sure: significant points could be
mistaken as outliers and be discarded before the optimizer could
figure out the correct solution...
To avoid a "good but wrong" fit, I'm afraid that we'd have to
introduce several parameters (like "do not discard outliers before
iteration n", "do not discard more than m points", etc.)
The tuning of those parameters will probably not be obvious, and
they will surely complicate the code (e.g. input data like the
"target" and "weight" array won't be "final").


The advantage I see is not correctness (since the algorithm you outline will converge correctly), but reducing function evaluations. (I don't have
data to back up this assertion.) Without inline data editing the
optimization algorithm would "waste" the evaluations between when the
outliers became obvious, and when the optimization converges. With the "inline" scheme, the outliers are deleted as soon as possible, and the remaining evaluations are used to converge towards the correct solution.

Converging to a "good but wrong" will always be a risk of any algorithm that automatically throws away data. As with our other algorithms, I'm
expecting the user to know when the algorithm is a good fit for their
problem. The use case I see is when the observations contain mostly real
data, and a few random numbers. The bad observations can be hard to
identify apriori, but become obvious during the fitting process.


[...]

What works in one domain might not in another.
Thus, the feature should not alter the "standardness", nor decrease
the robustness of the implementation. Caring for special cases
(for which the feature is useful) may be obtained by e.g. use the
standard algorithm as a basic block that is called repeatedly, as
hinted above (and tuning the standard parameters and input data
appropriately for each call).


I was not expecting the response that [math] may not want this feature. I'm o.k. with this result since I can implement it as an addition to [math],
though the API won't be as clean.


IMHO, we cannot assume that fiddling with some of the data points while
the optimization progresses won't alter the correctness of the solution.

I think that when points are deemed "outliers", e.g. using external
knowledge not available to the optimizer, there should be removed; and
the optimization be redone on the "real" data.
As I understand it, "robust" optimization is not really a good name (I'd suggest "fuzzy", or something) because it will indeed assign less weight
to data points solely on the basis that they are less represented among
the currently available data, irrespective of whether they actually
pertain to the phenomenon being measured.
At first sight, this could allow the optimizer to drag farther and farther
away from the correct solution (if those points were no outliers).



At first sight, I'd avoid modification of the sizes of input data
(option 2);
from an API usage viewpoint, I imagine that user code will require
additional
"length" tests.

Couldn't the problem you mention in option 1 disappear by having
different
methods that return the a-priori weights and the modified weights?


As we are already much too stringent with our compatibility policy, I
would allow the case were only *very* advanced users would have
problems. So it would seem fair to me if we can do some changes were the users of the factory are unaffected and expert users who decided the
factory was not good for them have to put new efforts.


Before embarking on this, I would like to see examples where the "inline" outlier rejection is leads to a (correct) solution impossible to achieve
with the approach I've suggested.


As discussed above, I don't see any cases where "inline" outlier rejection will result in a less correct solution than the algorithm you outline.

I do see a potential case, in my current work. ;-)

I do
think we can save a significant number of function evaluations by using
"inline" outlier rejection.

We can of course talk about a performance-correctness trade-off, perhaps useful for cases where the risk is low (lots of data, known expected rate of
outliers).

IIUC the reference you provided, it's seems that we only need a hook to
allow "outside" modification of the weights (?).
Could it be provided with an interface like the following:

public interface WeightValidator {
    /**
     * @param weights Current weights.
     * @param residuals Current residuals.
     * @return the adjusted weights.
    RealVector validateWeights(RealVector weights,
                               RealVector residuals);
}

(similar to the suggestion for MATH-1144)?


Best regards,
Gilles


Best Regards,
Evan


[...]


[1] http://markmail.org/message/e53nago3swvu3t52
     https://issues.apache.org/jira/browse/MATH-1105
[2] http://www.mathworks.com/help/curvefit/removing-outliers.html
http://www.mathworks.com/help/curvefit/least-squares-fitting.html



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to