Re: [math] speeding up percentile based statistics

Luc Maisonobe Sun, 26 Sep 2010 11:37:41 -0700

Le 26/09/2010 20:24, Phil Steitz a écrit :
> On 9/26/10 12:51 PM, Luc Maisonobe wrote:
>> Le 26/09/2010 18:21, Ted Dunning a écrit :
>>> I was about to write this suggestion down.
>>>
>>> Instead, I will simply concur.
>>>
>>> +1
>>>
>>> The safe version can do the checksum or make a copy.  Either is safe
>>> and the
>>> costs are similar (O(n) on every call and no memory or O(1) on every
>>> call
>>> after the first, O(n) memory).
>>
>> OK. Then I will have a copy of the data in the class and for each
>> evaluate method with a double[] values parameter there will also be a
>> method without this parameter that will run on the already provided
>> array. The implementations with the value parameter will have an
>> implementation roughly similar to:
>>
>> public double evaluate(final double[] values, final double p) {
>>    checkCache(values);
>>    return evaluate(p);
>> }
>>
>> private void checkCache(final double[] values) {
>>    final int hash = Arrays.hashcode(values);
>>    if ((values != originalValues) || (hash != originalHash)) {
>>       clearCache();
>>       originalValues = values;
>>       originalHash   = hash;
>>       copiedValues   = values.clone();
>>    }
>> }
>>
>> Starting to work on it ...
> 
> -0 on the approach.  Be careful with
> 
> 1) array offset arguments.  The contract of evaluate(p) will have to be
> evaluation on the same segment of the same array used in the last call.


I didn't see this, thanks for the hint.

> 2) For DescriptiveStatistics, the cache will have to be cleared when
> values are added between calls, since getPercentile(p) is already
> defined for that class and needs to return the value based on the
> current dataset.

Yes.

> 
> The -0 as opposed to +1 is because
> 
> a) handling offsets is going to make checkCache more complicated

Just two more values to store. What would be copied is the array slice.
This mean keeping the same array but changing the offsets would
invalidate cache.

> b) while the probability of collision may be very low in practice, there
> is no guarantee that hash collisions won't occur (in fact, it is certain
> that they *can* occur) and the result could be *nasty*, especially given
> that we are not telling the user we are doing it.

Thinking about it again, we could replace hash by a complete equality
test, but this would need to keep two copies of the array (one in
original order and another one in reorganized order). It would even be
faster as not equal arrays are detected on the first different element.
The cost is memory consumption.

> c) users that pass copies of arrays on subsequent calls containing the
> same values will not get the benefit of the caching.

With an equality test, this problem would be solved too.

> 
> It may be better to just add setData(double[]), setData(double[], start,
> end) and evaluate(p) (and evaluate()) methods and define the contract of
> evaluate(p) to refer to the data added in the setData methods. We could
> talk about extending AbstractUnivariateStatistic similarly, adding
> setData and evaluate methods similarly - possibly evaluate(start, end)
> as well.

I think it would be worth having these methods pushed up in the class
hierarchy. However, the top level interface could be changed only in 3.0.

Luc

> 
> Phil
>>
>> Luc
>>
>>>
>>> I would tend toward the copy option.
>>>
>>> On Sun, Sep 26, 2010 at 8:39 AM, Mikkel Meyer Andersen<m...@mikl.dk> 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Why not simply make 3) the default, but also supply a
>>>> *IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we
>>>> might as well do it instead of not doing any check at all). This means
>>>> that users who read the documentation can gain something when they
>>>> explicitly ask for it, while users who don't read the docs are still
>>>> safe.
>>>>
>>>> Cheers, Mikkel.
>>>>
>>>> 2010/9/26 Luc Maisonobe<luc.maison...@free.fr>:
>>>>> Le 23/09/2010 20:57, Luc Maisonobe a écrit :
>>>>>> Hi,
>>>>>>
>>>>>> Resuming this thread, as I will be soon able to work on this (I
>>>>>> hope).
>>>>>> I propose to start by solving
>>>>>> <https://issues.apache.org/jira/browse/MATH-417>  using a simple
>>>>>> partition-based selection algorithm and preserving pivot information.
>>>>>> This would help both the single call use-case (we don't sort
>>>>>> everything)
>>>>>> and the multiple call case (we don't redo on subsequent calls work
>>>>>> already done in the first calls.
>>>>>
>>>>> The current API of the Percentile method is derived from
>>>>> AbstractUnivariateStatistic and provides several evaluate methods that
>>>>> have the data array as a parameter.
>>>>>
>>>>> This is compatible with the one-call use case without any semantic
>>>>> change, but does not fit the multiple-calls use case seamlessly.
>>>>>
>>>>> If we want to cache some work already done on the array between calls
>>>>> with different values of p (the percentile value), we have to check if
>>>>> the array is still the same as the cached one on which we have done
>>>>> some
>>>>> partitioning and saved some pivots. This can be done by storing a
>>>>> reference to the original array in addition to a copy of its content
>>>>> (which we reorganize) and with an "if (values == originalValues)"
>>>>> statement. This would however not protect against changes in the array
>>>>> elements themselves done by the user between calls. A better
>>>>> protection
>>>>> could be to compute a hash code of the array content and compare it to
>>>>> the original one at each call. However, this would add a linear cost.
>>>>> The user who encountered the performances problems added an addValues
>>>>> method to set up the array once and simply ignored the values
>>>>> passed as
>>>>> an argument to the evaluate methods. This breaks the semantics of the
>>>> API.
>>>>>
>>>>> So I see four choices about partitioning caching and would like to
>>>>> have
>>>>> people express which one they prefer. A clearCache or similar public
>>>>> method would be made available to users to ensure clean computation if
>>>>> the caching mechanism gets on their way.
>>>>>
>>>>>   1) do nothing to check the array is the same between calls and
>>>>> blindly
>>>>>     assumes it IS the same. Users would really need to call clearCache
>>>>>     when they provide a new array
>>>>>     pros: very simple
>>>>>     cons: error-prone for the user as it relies only on reading and
>>>>>     understanding a documentation that would change with new version
>>>>>
>>>>>   2) check only array reference equality (with ==) to check the array
>>>>>     is the same as the previous call or not. Users would need to call
>>>>>     clearCache only if they use the same array but have changed its
>>>>>     content.
>>>>>     pros: trade-off between efficiency and reliability,
>>>>>           handles most cases properly
>>>>>     cons: may be wrong in corner cases
>>>>>
>>>>>   3) check array content using an hash code. Users would generally
>>>>> don't
>>>>>     need to call clearCache at all so it would not be provided
>>>>>     pros: works in all cases
>>>>>     cons: add linear cost for array checking
>>>>>
>>>>>   4) remove the double[] values parameter from the API and use a
>>>>> separate
>>>>>     addValues method, hence the class will reset its cache only when
>>>>>     addValues is called
>>>>>     pros: works in all cases
>>>>>     cons: Percentile could not implement UnivariateStatistic anymore
>>>>>
>>>>> My preference is choice 2.
>>>>>
>>>>> What do you think ?
>>>>>
>>>>> Luc
>>>>>
>>>>>>
>>>>>> Partition-based algorithms are O(n) in expected time but not in worst
>>>>>> case time. If end users ask for it later, we could improve again by
>>>>>> adding the Musser's introselect that guarantees O(n) behavior even in
>>>>>> worst case.
>>>>>>
>>>>>> Another possibility would be to have an additional method that would
>>>>>> compute percentiles for several thresholds at once, but this would
>>>>>> require returning an array of results and I think it would be
>>>>>> difficult
>>>>>> to mix with other statistics (mainly the DescriptiveStatistics.apply
>>>>>> method, which is used by the users I talked to).
>>>>>>
>>>>>> Would this meet everyone concerns ?
>>>>>> Luc
>>>>>>
>>>>>>
>>>>>> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
>>>>>>> Ted,
>>>>>>>
>>>>>>>> O(n), not n
>>>>>>>>
>>>>>>>> Expected case is n + n/2 + n/4 ...<  2n
>>>>>>>
>>>>>>> Yes, that I am already more incline to buy.
>>>>>>>
>>>>>>> Dim.
>>>>>>>
>>>> ----------------------------------------------------------------------------
>>>>
>>>>>>>
>>>>>>> Dimitri Pourbaix                         *
>>>>>>> Institut d'Astronomie et d'Astrophysique *      Don't worry, be
>>>>>>> happy
>>>>>>> CP 226, office 2.N4.211, building NO     *         and CARPE DIEM.
>>>>>>> Universite Libre de Bruxelles            *
>>>>>>> Boulevard du Triomphe                    *      Tel :
>>>>>>> +32-2-650.35.71
>>>>>>>   B-1050 Bruxelles                        *      Fax :
>>>>>>> +32-2-650.42.26
>>>>>>> http://sb9.astro.ulb.ac.be/~pourbaix     * mailto:
>>>> pourb...@astro.ulb.ac.be
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] speeding up percentile based statistics

Reply via email to