aherbert commented on PR #695: URL: https://github.com/apache/commons-collections/pull/695#issuecomment-4804426754
Short answer: Bloom filter operations are not atomic. Making them so is too expensive. Whatever option we choose here will be a partial fix and the filter is likely compromised in any case. The library should be used when you have control over the entire stack of code that generates indices. Then you will not have a problem. This is a bug for an atypical use case. --- I think that the library is mainly optimised for correct usage. If you deliberately use bad content then this lazy checking will result in the filter notifying the caller via an exception that it failed after it has modified its internal state. The filter should then be considered broken. Note that if you do early checking on the indices and raise an exception on the first bad index, you may have already merged some other indices and so have a broken filter anyway. Unfortunately making the filter merge operations atomic is far too expensive for practical use. So any merge will likely break the filter if an index is invalid, no matter what code changes we make here. The OP notes that index checking is performed in SimpleBloomFilter so that cannot have a corrupted state. But it would be corrupted if indices have been already merged. Note also in the same class the ``SimpleBloomFilter.merge(BitMapExtractor)`` can merge bits above the size of the filter defined by the shape. It will raise an exception but not fix its own state (which it could do with masking out the bad bits before raising the exception). So that also has an issue with broken cardinality after a bad merge. We could modify the code to remove invalid indices before raising. Here the indices are in a TreeSet. So the indices could be removed with: ```java indices.tailSet(shape.getNumberOfBits()).clear(); ``` But this only partially corrects something that is broken. It is still broken, just less so. I think there are several solutions which require some discussion: 1. Clean up bad indices before throwing. The filter could have partially merged indices that do not correspond to any real hashed object. 2. Leave it as is and document that you should never ignore exceptions and then forever consider the filter compromised. 3. Set an internal flag before throwing the exception. This internal state flag would be similar to the one used in the ``ArrayCountingBloomFilter``. It knows when it has been broken based on this state flag. You can check this using ``isValid()`` at any point. 4. Make them atomic. You could go through all the indices and check them; them go through them all again and add them. This is double the computation cost for all correct usage of the classes to stop malicious usage which really should not be a problem anyway. Unfortunately the ``isValid`` flag is part of the ``CountingBloomFilter`` interface. Perhaps it should have been put into the ``BloomFilter`` interface instead. Adding it now would have to be a default method that returns true. We could then implement it correctly in all the filters in this codebase. Note: You cannot go through the indices again and remove them. You may remove indices that are from other object hashes added to the filter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
