[Numpy-discussion] Re: Curious performance different with np.unique on arrays of characters

Charles R Harris Thu, 14 Sep 2023 11:12:52 -0700

On Thu, Sep 14, 2023 at 11:34 AM Devulapalli, Raghuveer <
raghuveer.devulapa...@intel.com> wrote:


> What processor you are running this on? np.sort uses AVX-512 accelerated
> sorting for np.int32, so just wondering if you that is the reason for this
> difference.
>
> Raghuveer


We also have radix sort for stable sorting of int8, int16, which should be
quite fast.

Hmm, I wonder if radix sort could be vectorized?

When we dropped Python 2.7, there were some folks who ended up using a
uint8 array subtype for storing data. All they needed to add was automatic
translation to strings for
certain accesses. This gave them a 4x advantage in storage space.



>
>
> > -----Original Message-----
> > From: sal...@caltech.edu <sal...@caltech.edu>
> > Sent: Wednesday, September 13, 2023 6:14 PM
> > To: numpy-discussion@python.org
> > Subject: [Numpy-discussion] Curious performance different with np.unique
> on
> > arrays of characters
> >
> > Hello -
> >
> > In the course of some genomics simulations, I seem to have come across a
> curious
> > (to me at least) performance difference in np.unique that I wanted to
> share. (If
> > this is not the right forum for this, please let me know!)
> >
> > With a np.array of characters (U1), np.unique seems to be much faster
> when
> > doing np.view as int -> np.unique -> np.view as U1 for arrays of decent
> size. I
> > would not have expected this since np.unique knows what's coming in as
> S1 and
> > could handle the view-stuff internally. I've played with this a number
> of ways (e.g.
> > S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or
> 10k
> > elements) and seem to notice the same pattern. A short illustration
> below with
> > U1, int32, return_counts = False, 10 vs 10k.
> >
> > I wonder if this is actually intended behavior, i.e. the view-stuff is
> actually a good
> > idea for the user to think about and implement if appropriate for their
> usecase (as
> > it is for me).
> >
> > Best regards,
> > Shyam
> >
> >
> > import numpy as np
> >
> > charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k =
> > np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
> >
> > def unique_basic(x):
> >     return np.unique(x)
> >
> > def unique_view(x):
> >     return np.unique(x.view(np.int32)).view(x.dtype)
> >
> > In [27]: %timeit unique_basic(charlist_10)
> > 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops
> each)
> >
> > In [28]: %timeit unique_view(charlist_10)
> > 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops
> each)
> >
> > In [29]: %timeit unique_basic(charlist_10k)
> > 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
> >
> > In [30]: %timeit unique_view(charlist_10k)
> > 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops
> each)
> >
> > In [31]: np.__version__
> > Out[31]: '1.25.2'
> >
> >
> >
> > --
> > Shyam Saladi
> > https://shyam.saladi.org
> > _______________________________________________
> > NumPy-Discussion mailing list -- numpy-discussion@python.org To
> unsubscribe
> > send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: raghuveer.devulapa...@intel.com
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: charlesr.har...@gmail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Curious performance different with np.unique on arrays of characters

Reply via email to