rtpsw commented on PR #13880: URL: https://github.com/apache/arrow/pull/13880#issuecomment-1236305581
I recently read [this key-map doc](https://github.com/apache/arrow/blob/50a7d15dfb4cbc4dd449ff2bb3ba2b1cde62a3ab/cpp/src/arrow/compute/exec/doc/key_map.md), and in particular [this section in it](https://github.com/apache/arrow/blob/50a7d15dfb4cbc4dd449ff2bb3ba2b1cde62a3ab/cpp/src/arrow/compute/exec/doc/key_map.md#32-bit-hash-vs-64-bit-hash). which describes a practical limit of ~16 million inserted keys. Thinking about similar implications here, the latest version of PR also has some practical limit on the number of inserted keys due to the potential for collisions; specifically, these collisions are currently not arbitrated, i.e., if a collision occurs then the result is undefined. The probability of a collision after inserting 16 million (2^{24}) keys (using the slow-path) can be estimated using the [square-approximation of the birthday problem](https://en.wikipedia.org/wiki/Birthday_problem#Square_approximation) to be 2^{2*24}/2/2^{64} = 2^{-17}, which some applications (or pe ople) won't see as negligible. Given the above, I see several options we have: 1. Leave the code as is, to be dealt with later, and just document the limitation for now. 2. Add logic to arbitrate upon collision. This could involve reusing the Swiss table described in the doc. In any case, being an extra layer of logic at a hot-spot of the code, this is expected to have a noticeable performance cost. 3. Switch to an internal key type of 128-bit, for which the collision probability 2^{2*24}/2/2^{128} = 2^{-81} is negligible for all but extreme applications that may not even be justified in data science. Incidentally, this would also allow supporting the decimal-128 type using the fast-path. The performance cost of this could likely be kept small using vectorization for computing a slightly-tweaked pair of 64-bit hashes. @westonpace, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org