Re: [racket-users] Re: The performance of ‘set’ vs. lists+‘remove-duplicates’

'Paulo Matos' via Racket Users Thu, 06 Dec 2018 13:52:26 -0800


For those interested, I have prepared a benchmark using the benchmark
lib. Kudos to Josh and Vincent for it. I use it _all_ the time and have
a dozen or so benchmark files to track some racket operations as racket
evolves - so I know how it will affect my software.

If you don't care about code and graphs here's the summary: Tony hit it
in the nail when he mentioned that using hashes with #t value is fast.
For small sets (tested up to 4096 elements) wins _all_ the time against
racket/set or racket/list+remove-duplicates - for all the operations
except copy and ->list. Copying hashes is expensive and the ->list
conversion is obviously fastest when you already have a list to begin
with. What's slightly depressing is that the actual set implementation
is always the slowest. So, if you want performance and your program
doesn't do too many copies, you might want to try using hashes with #t
values.

In this case I have created a simple interface for sets using lists and
hashes that look like this:
set-as-list.rkt:

#lang racket/base

(require racket/list)

(provide set set-member? set-add set-remove set-count set-union set-copy
set->list)
;; Providing a set-like interface using lists

(define (set . vs)
  (remove-duplicates vs))

(define (set-member? s x)
  (member x s))

(define (set-add s x)
  (if (set-member? s x)
      s
      (cons x s)))

(define (set-remove s x)
  (remove x s))

(define (set-count s)
  (length s))

(define (set-union s1 s2)
  (remove-duplicates (append s1 s2)))

(define (set-copy s)
  (for/list ([v (in-list s)]) v))

(define (set->list s)
  s)

set-as-hash.rkt:

#lang racket/base

(provide set set-member? set-add set-remove set-count set-union set-copy
set->list)
;; Providing a set-like interface using hashes

(define (set . vs)
  (for/hash ([v (in-list vs)])
    (values v #true)))

(define (set-member? s x)
  (hash-ref s x #false))

(define (set-add s x)
  (hash-set s x #true))

(define (set-remove s x)
  (hash-remove s x))

(define (set-count s)
  (hash-count s))

(define (set-union s1 s2)
  (for/fold ([h s1])
            ([v (in-hash-keys s2)])
    (set-add h v)))

(define (set-copy h)
  (hash-copy h))

(define (set->list h)
  (hash-keys h))

small-sets-vs-hashes-vs-lists.rkt:

#lang racket

(require (prefix-in list: "set-as-list.rkt")
         (prefix-in hash: "set-as-hash.rkt"))

;; Benchmarking
; 1. sets
; 2. lists
; 3. hasheq with #t value

; For each structure we will benchmark:
; 1. adding an element
; 2. remove element
; 3. map elements
; 4. count elements
; 5. check membership
; 6. copy

(require benchmark plot/pict racket/match racket/vector racket/list)

; define sizes to test
(define sizes '(8 16 32 64 128 256 512 1024 2048 4096))

; define structures for each size
(define vs-list (map (lambda (sz) (build-list sz (thunk* (random (/ sz
2))))) sizes))
(define vs-single (flatten vs-list))
(define sets  (map (lambda (l) (apply set      l)) vs-list))
(define setls (map (lambda (l) (apply list:set l)) vs-list))
(define seths (map (lambda (l) (apply hash:set l)) vs-list))

; operation table
(define table
  (hasheq
   'make    (hasheq 'set set         'list list:set         'hash hash:set)
   'add     (hasheq 'set set-add     'list list:set-add     'hash
hash:set-add)
   'member? (hasheq 'set set-member? 'list list:set-member? 'hash
hash:set-member?)
   'remove  (hasheq 'set set-remove  'list list:set-remove  'hash
hash:set-remove)
   'count   (hasheq 'set set-count   'list list:set-count   'hash
hash:set-count)
   'copy    (hasheq 'set set-copy    'list list:set-copy    'hash
hash:set-copy)
   'to-list (hasheq 'set set->list   'list list:set->list   'hash
hash:set->list)
   'union   (hasheq 'set set-union   'list list:set-union   'hash
hash:set-union)))

(define results
  (run-benchmarks
   ; operations (whats)
   (list 'add 'member? 'remove) 'count 'copy 'to-list)

   ; list of options (hows)
   (list
    ; sizes (and their indices) in the sizes list
    (map cons (range (length sizes)) sizes)
    ; implementations of operations
    (list 'set 'list 'hash))

   ; to run each benchmark
   (lambda (op index/size impl)
     (define fn (hash-ref (hash-ref table op) impl))

     (case op
       ; function receives a list
       [(make) (apply fn (list-ref vs-list (car index/size)))]
       ; function receives a set
       [(to-list copy count) (fn (list-ref (match impl
                                             ['set sets]
                                             ['list setls]
                                             ['hash seths])
                                           (car index/size)))]
       ; function receives two sets
       [(union) (fn (list-ref (match impl
                                ['set sets]
                                ['list setls]
                                ['hash seths])
                              (car index/size))
                    (list-ref (match impl
                                ['set sets]
                                ['list setls]
                                ['hash seths])
                              (random (length index/size))))]
       ; function receives a set and a value
       [(add member? remove) (fn (list-ref (match impl
                                             ['set sets]
                                             ['list setls]
                                             ['hash seths])
                                           (car index/size))
                                 (list-ref vs-single (random (length
vs-single))))]))

   ; don't extract time, instead time (run ...)
   #:extract-time 'delta-time
   #:num-trials 100000))

(parameterize ([plot-x-ticks no-ticks])
  (plot-pict
   #:title "small sets benchmark"
   #:width 3840
   #:height 2160
   #:x-label #f
   #:y-label "normalized time"
   (render-benchmark-alts
    ; default options
    (list (cons 0 8) 'list)
    results
    ; format options so we can omit the index in the size list
    #:format-opts (lambda (opts)
                    (let ([index/size (car opts)]
                          [impl (cadr opts)])
                      (format "size: ~a, ~a" (cdr index/size) impl))))))

You might have to play with width/height of plot for your monitor and
then play with the sizes since visualizing all the sizes at once is a
pain. Also, run it twice, first with copy and to-list operations and
then the remainder - otherwise you won't be able to make out much in the
plot.

I am keen to get feedback on generating a nicer, most understandable
plot - I always wished I was able to generate the beautiful
visualizations we tend to see in the Web2.0 era but never had the
artistic eye or the inclination to learn how to do it.

More importantly though, I am keen to hear on the issues with regards to
the results. So, is set really that much worse because of the generic
contracts? Also, is there a way to speed up the copying of hash tables?

Paulo Matos

On 06/12/2018 17:08, 'Paulo Matos' via Racket Users wrote:
> 
> 
> On 06/12/2018 16:39, Vincent St-Amour wrote:
>> On Thu, 06 Dec 2018 07:05:03 -0600,
>> 'Paulo Matos' via Racket Users wrote:
>>>
>>>
>>>
>>> On 05/12/2018 11:55, Tony Garnock-Jones wrote:
>>>> I suspect it will be slow because sets are generics, and generics are
>>>> slow. 
>>>
>>> I am curious now. How slow? Why? Do you have any data backing this up?
>>> Generics are very useful, I would be very disappointed if they are
>>> indeed very slow.
>>
>> For what it's worth, I seem to recall that generics, while they do
>> introduce indirection, and not particularly costly on their own.
>>
> 
> ah... that's more reassuring. Thanks.
> 
>> Contracts for generics, though, which the set library is using, are
>> really costly.
>>
> 
> Good! That's understandable.
> 
> I am preparing a battery of benchmarks on this motivated by this thread
> which I hope to share soon.
> 
>> Vincent
>>
> 

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: The performance of ‘set’ vs. lists+‘remove-duplicates’

Reply via email to