Hello,

"Dr. Arne Babenhauserheide" <[email protected]> writes:

> Tomas Volf <[email protected]> writes:
>
>> the performance increase looks nice, but see few comment below please.
>>
>> "Dr. Arne Babenhauserheide" <[email protected]> writes:
>>> While optimizing an implementation of levenshtein distance, I found that
>>> (min ...) took up a significant part of the profile (16%), so I tried
>>> replacing it with a simple implementation in Scheme. That worked out
>>> surprisingly well (shaved off about 25% of the total runtime) and when I
>>> benchmarked it directly it turned out to be between 1.5 and 3 times
>>> faster than the min in numbers.c.
>>
>> There are few difference in behavior
>
> Thank you for catching them!
>
> Those look like we should fix them. But we should base the fixes on the
> more efficient version by David, not on my initial one.

I believe it has the same problems due to fairly similar design, so any
fixes should be transferable.  I have picked your version for testing
and benchmarking, since the version from David did not finish running in
a reasonable time, and I had to ^C it.

>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (min -0.0 0.0)
>> $8 = -0.0
>> scheme@(guile-user)> (mins -0.0 0.0)
>> $9 = 0.0
>
> This means that (< -0.0 0.0) ⇒ #f
>
> What I also see: (negative? -0.0) ⇒ #f
>
> Something is odd there. That -0.0 is neither larger nor smaller than
> +0.0 may be correct, but shouldn’t it be negative?

I am not sure, but 0.0 is not positive, so -0.0 not being negative does
have some logic.  FWIW, Chez scheme does behave the same:

--8<---------------cut here---------------start------------->8---
> (negative? -0.0)
#f
> (positive? 0.0)
#f
--8<---------------cut here---------------end--------------->8---

>
>> scheme@(guile-user)> (min (nan) 1)
>> $15 = +nan.0
>> scheme@(guile-user)> (mins (nan) 1)
>> $16 = 1
>> --8<---------------cut here---------------end--------------->8---
>>
>> I would also expect `min' to not care about the order of arguments.  The
>> original implementation does not, but yours does
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (min (nan) 1)
>> $21 = +nan.0
>> scheme@(guile-user)> (min 1 (nan))
>> $22 = +nan.0
>>
>> scheme@(guile-user)> (mins (nan) 1)
>> $19 = 1
>> scheme@(guile-user)> (mins 1 (nan))
>> $20 = +nan.0
>> --8<---------------cut here---------------end--------------->8---
>
> NaN is a special case; it’s neither smaller nor larger, but any
> computation with NaN must be NaN.
>
> I’m not sure however how min should deal with NaN. It should obviously
> be order-independent, but returning NaN for min seems strange when
> there’s a non-min number.
>
> I lack the background information about numbers to judge what’s correct
> here. Can you judge that?

Returning NaN is consistent with Chez, but I am not qualified to say
whether it is correct.  I just did basically white-box testing of both
yours and the built-in version and took note of the differences. ^_^

>
> It seems like we’ll need a full set of tests for min and max before we
> can replace them.
>
> I found an article about min and max in C++:
> https://pixorblog.wordpress.com/2016/06/27/some-remarks-about-minmax-functions/
>>  The IEEE 754 says:
>>     In section 6.2 of the revised IEEE 754-2008 standard there are two
>> anomalous functions (the maxnum and minnum functions that return the maximum
>> of two operands that are expected to be numbers) that favor numbers — if just
>> one of the operands is a NaN then the value of the other operand is returned.
>
> So min and max should never return NaN if the arguments contain any
> other value.

I believe this simplifies it a bit too much.  The 2019 version (well,
the draft, I am not buying the final document, and consider the very
need to *buy* a standard to be absurd) specifies both `minimum' and
`minimumNumber' with different behavior regarding NaNs (section 9.6).  I
do not know whether r7rs prescribes which of those shall be used, but
probably not.

>
> But from r7rs:
>
>     An arithmetic operation where one operand is NaN
>     returns NaN, unless the implementation can prove that the
>     result would be the same if the NaN were replaced by any
>     rational number.
>
> Are min and max arithmetic operations?

¯\_(ツ)_/¯

> Our numbers.c implementation treats it as such.
>
>> One last observation is that your version does not preserve exactness of
>> the arguments, which seems to be a requirement in both R6RS (11.7.4,
>> Arithmetic operations) and R7RS (6.2.2)
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (min +inf.0 1)
>> $29 = 1.0
>> scheme@(guile-user)> (mins +inf.0 1)
>> $30 = 1
>> --8<---------------cut here---------------end--------------->8---
>
> From r7rs:
>
>     Note: If any argument is inexact, then the result will also be
>     inexact (unless the procedure can prove that the inaccuracy is not
>     large enough to affect the result, which is possible only in unusual
>     implementations). If min or max is used to compare numbers of mixed
>     exactness, and the numerical value of the result cannot be
>     represented as an inexact number without loss of accuracy, then the
>     procedure may report a violation of an implementation restriction.
>
> Here’s a case where this causes my implementation to be incorrect:
>
> (mins 1/7 (exact->inexact 1/7))
> $27 = 0.14285714285714285
> (mins 1/13 (exact->inexact 1/13))
> $32 = 1/13
>
> Also the r7rs says about < and >:
>
>     Note: While it is not an error to compare inexact numbers using
>     these predicates, the results are unreliable because a small
>     inaccuracy can affect the result; this is especially true of = and
>     zero?. When in doubt, consult a numerical analyst.
>
>
> I have not found a fix to this, yet. Doing an exactness check inside the
> loop may be quite expensive.

Presumably this is the reason the built-in version is perceived as
"slow". :) I am unsure how to solve this.

>
> This is the optimized version by David in hoot:
>
>   (import (ice-9 match))
>   (define-syntax-rule (define-minmax-expansion f %f compare)
>     (begin
>       (define %f
>         (case-lambda
>           ((x) x)
>           ((x y) (if (compare x y) x y))
>           ((x . rest)
>            (let lp ((m x)
>                     (rest rest))
>              (match rest
>                (() m)
>                ((x . rest)
>                 (if (compare x m)
>                     (lp x rest)
>                     (lp m rest))))))))
>       (define-syntax f
>         (lambda (stx)
>           (syntax-case stx ()
>             (f (identifier? #'f) #'%f)
>             ((_ x) #'x)
>             ((_ x y)
>              #'(let ((x* x)
>                      (y* y))
>                  (if (compare x* y*) x* y*)))
>             ((_ x . y)
>              #'(f x (f . y))))))))
>
>   (define-minmax-expansion min %min <)
>   (define-minmax-expansion max %max >)
>
>
>> As for the performance, while your version is a lot faster for integers
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (define x (iota #_1_000_000))
>
> What is #_? Is that a specialized prefix (it does not work in my Guile)?
> Do I need to import something?

Ah, my bad, it is an extension[0] from my library, it helps to make
large numbers more readable.  It is sourced automatically on REPL start
and I did not realize I have used it.  You can just drop all of the _,
including the #_ prefix, and it will work.  Sorry about that.

0: https://git.wolfsden.cz/guile-wolfsden/tree/wolfsden/x/underscore.scm

>
>> scheme@(guile-user)> ,t (let loop ((n 100)) (if (zero? n) (when #f #f) 
>> (begin (apply min x) (loop (1- n)))))
>> ;; 2.208355s real time, 2.208266s run time.  0.000000s spent in GC.
>>
>> scheme@(guile-user)> ,t (let loop ((n 100)) (if (zero? n) (when #f #f) 
>> (begin (apply mins x) (loop (1- n)))))
>> ;; 1.044301s real time, 1.044260s run time.  0.000000s spent in GC.
>> --8<---------------cut here---------------end--------------->8---
>>
>> it is at the cost of performance regressions elsewhere
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (define x (map (cut / <> 3) (iota #_1_000_000)))
>>
>> scheme@(guile-user)> ,t (let loop ((n 100)) (if (zero? n) (when #f #f) 
>> (begin (apply min x) (loop (1- n)))))
>> ;; 2.720850s real time, 2.720736s run time.  0.000000s spent in GC.
>>
>> scheme@(guile-user)> ,t (let loop ((n 100)) (if (zero? n) (when #f #f) 
>> (begin (apply mins x) (loop (1- n)))))
>> ;; 3.476514s real time, 3.476266s run time.  0.000000s spent in GC.
>>
>> scheme@(guile-user)> (define x (map (cut + <> 0.1) (iota #_1_000_000)))
>>
>> scheme@(guile-user)> ,t (let loop ((n 100)) (if (zero? n) (when #f #f) 
>> (begin (apply min x) (loop (1- n)))))
>> ;; 2.144861s real time, 2.144699s run time.  0.000000s spent in GC.
>>
>> scheme@(guile-user)> ,t (let loop ((n 100)) (if (zero? n) (when #f #f) 
>> (begin (apply mins x) (loop (1- n)))))
>> ;; 2.719851s real time, 2.719682s run time.  0.000000s spent in GC.
>> --8<---------------cut here---------------end--------------->8---
>>
>> Please correct me if I have messed up somewhere in the above text.
>>
>> Tomas
>
> Looking at the benchmark (with the nicer version of the apply-part by
> David), I don’t see such a clear picture:
>
> The data:
>
> (define L1 (map (λ (x) (+ 0.1 x)) (iota 1)))
> (define L2 (map (λ (x) (+ 0.1 x)) (iota 2)))
> (define L3 (map (λ (x) (+ 0.1 x)) (iota 3)))
> (define L4 (map (λ (x) (+ 0.1 x)) (iota 6)))
> (define L5 (map (λ (x) (+ 0.1 x)) (iota 100)))
> (define L6 (map (λ (x) (+ 0.1 x)) (iota 1000)))
> (define L7 (map (λ (x) (+ 0.1 x)) (iota 1000000)))
>
> The code:
>
> (define min-match
>   (case-lambda
>     ((x) x)
>     ((x y) (if (< x y) x y))
>     ((x . rest)
>      (let lp ((m x)
>               (rest rest))
>        (match rest
>          (() m)
>          ((x . rest)
>           (if (< x m)
>               (lp x rest)
>               (lp m rest))))))))
>
> The results:
>
> ("arithmetic.bm: fixnum: min-1" 1000000 real 0.013668326 real/iteration 
> 1.3668326e-8 run/iteration 1.3577348e-8 core/iteration 1.0069007699999998e-8 
> gc 0.0)
> ("arithmetic.bm: fixnum: min-match-1" 1000000 real 0.011865027 real/iteration 
> 1.1865027e-8 run/iteration 1.1769995e-8 core/iteration 8.261654699999999e-9 
> gc 0.0)
>
> ("arithmetic.bm: fixnum: min-2" 100000 real 0.002098059 real/iteration 
> 2.098059e-8 run/iteration 2.092736e-8 core/iteration 1.74190197e-8 gc 0.0)
> ("arithmetic.bm: fixnum: min-match-2" 100000 real 0.002045983 real/iteration 
> 2.045983e-8 run/iteration 2.044696e-8 core/iteration 1.69386197e-8 gc 0.0)
>
> ("arithmetic.bm: fixnum: min-3" 100000 real 0.003547318 real/iteration 
> 3.547318e-8 run/iteration 3.546066e-8 core/iteration 3.19523197e-8 gc 0.0)
> ("arithmetic.bm: fixnum: min-match-3" 100000 real 0.004556954 real/iteration 
> 4.556954e-8 run/iteration 4.407558e-8 core/iteration 4.0567239699999994e-8 gc 
> 0.0)
>
> ("arithmetic.bm: fixnum: min-4" 10000 real 5.55016e-4 real/iteration 
> 5.55016e-8 run/iteration 5.53546e-8 core/iteration 5.18462597e-8 gc 0.0)
> ("arithmetic.bm: fixnum: min-match-4" 10000 real 0.001111687 real/iteration 
> 1.111687e-7 run/iteration 1.110488e-7 core/iteration 1.0754045970000001e-7 gc 
> 0.0)
>
> ^ this shows a factor 2 regression for apply on lists with 6 elements.
>
> ("arithmetic.bm: fixnum: min-5" 100000 real 0.176713492 real/iteration 
> 1.76713492e-6 run/iteration 2.39227622e-6 core/iteration 2.3887678797e-6 gc 
> 0.120515234)
> ("arithmetic.bm: fixnum: min-match-5" 100000 real 0.204362918 real/iteration 
> 2.0436291799999998e-6 run/iteration 2.69216163e-6 core/iteration 
> 2.6886532897000004e-6 gc 0.124272761)
>
>
> ^ but between 100 and one million elements, the regression is down to 10%.
>
> ("arithmetic.bm: fixnum: min-6" 10000 real 0.170988928 real/iteration 
> 1.70988928e-5 run/iteration 2.42823128e-5 core/iteration 2.42788044597e-5 gc 
> 0.128306471)
> ("arithmetic.bm: fixnum: min-match-6" 10000 real 0.20625809 real/iteration 
> 2.0625809e-5 run/iteration 2.73242117e-5 core/iteration 2.73207033597e-5 gc 
> 0.124695717)
>
> ("arithmetic.bm: fixnum: min-7" 10 real 0.217954551 real/iteration 
> 0.021795455100000003 run/iteration 0.030831519 core/iteration 
> 0.0308315154916597 gc 0.204189321)
> ("arithmetic.bm: fixnum: min-match-7" 10 real 0.238831113 real/iteration 
> 0.0238831113 run/iteration 0.030176484899999998 core/iteration 
> 0.0301764813916597 gc 0.112002681)
>
>
> What’s not tested here is min with a fixed number of arguments. Using as
> mins the definition of David in hoot (and given above):
>
> scheme@(guile-user) [2]> ,t (let loop ((n 100000000)) (if (zero? n) (when #f 
> #f) (begin (mins 1 2 3 4 5 6 7) (loop (1- n)))))
> ;; 0.086388s real time, 0.085594s run time.  0.000000s spent in GC.
> scheme@(guile-user) [2]> ,t (let loop ((n 100000000)) (if (zero? n) (when #f 
> #f) (begin (min 1 2 3 4 5 6 7) (loop (1- n)))))
> ;; 10.321335s real time, 13.731729s run time.  6.150539s spent in GC.

In all the above I see large variances in GC time, so I wonder whether
that had some effect on the result.  In my tests I have always executed
,gc first, so you can see the GC is always 0 (min/max should not really
need to trigger GC anyway).  I am unsure how large the effect of that is
though, possibly not large.

> Another advantage of a native Scheme implementation is that we avoid the
> round-trip through C, enabling the JIT compiler to optimize more.

True.

Have a nice day,
Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

Reply via email to