cyb70289 opened a new pull request #10009:
URL: https://github.com/apache/arrow/pull/10009


   Arrow mode kernel performance is bad compared with scipy.stats.mode
   (based on numpy.unique). Arrow mode kernel stores value:count pair in
   a map, while numpy.unique sorts the input array then count the adjacent
   same values. Per my test, the map approach only wins when there are
   many duplicated values (length / value_range > 100), looks not very
   useful in practice.
   
   This patch rewrites mode kernel to use the sort and count approach for
   floating points and integers with wide value range. 2x performance
   improvement is observed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to