rtpsw commented on PR #13880:
URL: https://github.com/apache/arrow/pull/13880#issuecomment-1236305581

   I recently read [this key-map 
doc](https://github.com/apache/arrow/blob/50a7d15dfb4cbc4dd449ff2bb3ba2b1cde62a3ab/cpp/src/arrow/compute/exec/doc/key_map.md),
 and in particular [this section in 
it](https://github.com/apache/arrow/blob/50a7d15dfb4cbc4dd449ff2bb3ba2b1cde62a3ab/cpp/src/arrow/compute/exec/doc/key_map.md#32-bit-hash-vs-64-bit-hash).
 which describes a practical limit of ~16 million inserted keys. Thinking about 
similar implications here, the latest version of PR also has some practical 
limit on the number of inserted keys due to the potential for collisions; 
specifically, these collisions are currently not arbitrated, i.e., if a 
collision occurs then the result is undefined. The probability of a collision 
after inserting 16 million (2^{24}) keys (using the slow-path) can be estimated 
using the [square-approximation of the birthday 
problem](https://en.wikipedia.org/wiki/Birthday_problem#Square_approximation) 
to be 2^{2*24}/2/2^{64} = 2^{-17}, which some applications (or pe
 ople) won't see as negligible.
   
   Given the above, I see several options we have:
   1. Leave the code as is, to be dealt with later, and just document the 
limitation for now.
   2. Add logic to arbitrate upon collision. This could involve reusing the 
Swiss table described in the doc. In any case, being an extra layer of logic at 
a hot-spot of the code, this is expected to have a noticeable performance cost.
   3. Switch to an internal key type of 128-bit, for which the collision 
probability 2^{2*24}/2/2^{128} = 2^{-81} is negligible for all but extreme 
applications that may not even be justified in data science. Incidentally, this 
would also allow supporting the decimal-128 type using the fast-path. The 
performance cost of this could likely be kept small using vectorization for 
computing a slightly-tweaked pair of 64-bit hashes.
   
   @westonpace, WDYT? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to