[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: Hey Raymond, > "Changing the dict implementation to a linked list would make your case perform better but would make most users worse off." Is there any standard benchmark? I would like to implement my idea, as I think that it is very unlikely that "most users would be made worse off" by a measurable amount if I implement what I have in mind. If there aren't standard benchmarks, I can always make artificial ones (insert a lot of entries, remove them, access, iterate, etc). Thank you, Daniel -- ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: Thank you for your reply. I didn't know that this was the expected behavior (found it at https://wiki.python.org/moin/TimeComplexity): "For these operations, the worst case n is the maximum size the container ever achieved, rather than just the current size. For example, if N objects are added to a dictionary, then N-1 are deleted, the dictionary will still be sized for N objects (at least) until another insertion is made." Thank you for pointing out! I still disagree with this design since it's not how a person using a hash map expects it to behave. That said, this link clearly shows that it's *not* a bug, and that we just have different opinions on what is the best trade-off between space, code complexity, and speed, especially on something as ubiquitous as a dictionary. Just one final attempt to gain support for my idea before I close the issue. When I proposed a linked list I didn't mean an "intro to programming" linked list, with pointers and a new malloc for every node. I meant simply adding two fields to PyDictKeyEntry with the indices of the next and previous valid entries. There would be a tiny memory overhead, and we would get a data structure that behaves how one would reasonably expect it to (compare with hash table found in the standard libraries of other languages. It's usually unexpected for an operation in a data structure to take time proportional to its historical value). I will close this issue in two days if there are no further replies. Thank you for the discussion! -- ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: Thank you again, Dennis. I would disagree with your conclusion for mainly 3 reasons: 1. The linked list proposal would increase the memory usage by 2 Py_ssize_t per entry, plus another 2 for the whole dictionary. I don't think this is an overwhelming amount of extra memory for Python, but of course it's open for discussion. Insertions and removals would become marginally slower (I'll try to measure it), membership wouldn't suffer, and iteration would become O(1) per element iterated on. 2. I think that this case is covered by "Dynamic Mappings" in the document you've linked to. 3. This is not about the ordering in ducts being first or second nature. Please read the example in bad_dict_example.py to see a bad case where hearing over a dict of size 1 can take milliseconds. I want to reiterate (pun not intended) that this is not a use case for me, but it surprised me that dictionaries display this behavior. It's not hard to imagine a real use case where it just happens that someone deletes elements from a dictionary in insertion order. Or that someone deletes all keys but the last inserted (in any order), resulting in a dictionary that takes way too long to iterate on. Thank you for the discussion. :) -- ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: I think the idea of augmenting the struts to maintain a linked list of elements together with periodically shrinking on removal would solve both time and space issues, right? Space will be always linear, and operations would still be constant amortized time (of course the worst case of removing will be linear because of the shrinking, but I guess this behavior is expected). I wrote a worse example in bad_dict_example.py submitted to this issue. This example would be fixed simply by the shrinking, but as is it's a pretty unexpected bug, as a simple sum(d.values()) can take milliseconds on a dictionary with only one entry (both key and value are int). -- ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: Please find attached a more complete example of the issue I am reporting. tl;dr: I can make `sum(d.values())` run in O(maximum_size_in_d's_history) instead of O(len(d)), even when len(d) == 1. The linked list approach would work in terms of making it faster, but we would still be using too much space. -- Added file: https://bugs.python.org/file50138/bad_dict_example.py ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: Thank you for your message! I'm not particularly looking for an implementation, I was just surprised by this behavior. It can get even worse. Consider this example: ``` d = large_dict() # remove all but one element of d, runs in quadratic time as explained above while len(d) > 1: del d[next(iter(d))] # now iterating over d takes O(d), even when d has only one item: # the following prints one key, but takes O(N) for key in d: print(key) ``` I think this example is better, since a person would expect iterating over a singleton dictionary would be really fast, but it can be made as slow as one wants. A "print_keys()" function would reasonably be expected to take O(size of dictionary), but it doesn't. Am I right to think that this is a performance bug? On Fri, Jul 2, 2021, 8:10 PM Dennis Sweeney wrote: > > Dennis Sweeney added the comment: > > For what it's worth, using collections.OrderedDict gives constant-time > behavior you're looking for. > > -- > nosy: +Dennis Sweeney > > ___ > Python tracker > <https://bugs.python.org/issue44555> > ___ > -- ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
Daniel Fleischman added the comment: The linked list solution is not as easy to implement as I expected, because of insertions. :( -- ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).
New submission from Daniel Fleischman : The following code takes quadratic time on the size of the dictionary passed, regardless of the dictionary (explanation below): ``` def slow_dictionary(d): while len(d) > 0: # Remove first element key = next(iter(d)) del d[key] ``` The problem is that when an element is deleted a NULL/NULL placeholder is set in its place (https://github.com/python/cpython/blob/818628c2da99ba0376313971816d472c65c9a9fc/Objects/dictobject.c#L1534) and when we try to find the first element with `next(iter(d))` the code needs to skip over all the NULL elements until it finds the first non-NULL element (https://github.com/python/cpython/blob/818628c2da99ba0376313971816d472c65c9a9fc/Objects/dictobject.c#L1713). I'm not sure of what is the best way to fix it, but note that simply adding a field to the struct with the position of the first non-NULL element is not enough, since a code that always deletes the SECOND element of the dictionary would still have linear operations. An easy (but memory-wasteful) fix would be to augment the struct PyDictKeyEntry with the indices of the next/previous non empty elements, and augment _dictkeysobject with the index of the first and last non empty elements (in other words, maintain an underlying linked list of the non empty entries). With this we can always iterate in O(1) per entry. (I tested it only on version 3.9.2, but I would be surprised if it doesn't happen in other versions as well). -- components: Interpreter Core messages: 396880 nosy: danielfleischman priority: normal severity: normal status: open title: Dictionary operations are LINEAR for any dictionary (for a particular code). type: performance versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue44555> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com