[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-08 Thread Inada Naoki


Inada Naoki  added the comment:

There is a macro benchmark we regularly used.
https://pyperformance.readthedocs.io/

You can try micro benchmark for most common use cases with pyperf like this:

```
$ pyperf timeit -s "def f(**kwargs): pass" -- "f(a=1,b=2,c=3)"
Mean +- std dev: 119 ns +- 3 ns
```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-08 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

Hey Raymond,

> "Changing the dict implementation to a linked list would make your case
perform better but would make most users worse off."

Is there any standard benchmark? I would like to implement my idea, as I
think that it is very unlikely that "most users would be made worse off" by
a measurable amount if I implement what I have in mind. If there aren't
standard benchmarks, I can always make artificial ones (insert a lot of
entries, remove them, access, iterate, etc).

Thank you,
Daniel

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-05 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

> I still disagree with this design since it's not 
> how a person using a hash map expects it to behave.

All mapping data structures have tradeoffs.  For the core dict type, we chose a 
structure that benefits most Python users most of the time.  If a user expects 
that the core dict type magically handles all cases optimally, that is a bug in 
their expectation, not a bug in Python.  

For the atypical pattern shown in slow_dictionary(), we recommend choosing a 
different data structure such as an OrderedDict or simply storing the keys in a 
deque.  This isn't a blow-off answer -- the stated goal of the collections 
module is to offer "specialized container datatypes providing alternatives to 
Python’s general purpose built-in containers, dict, list, set, and tuple."  
Those alternatives have different performance characteristics that may better 
serve particular access patterns.

Thanks for the suggestion, but this is not a bug.  It is an intentional design 
decision.  Changing the dict implementation to a linked list would make your 
case perform better but would make most users worse off.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-05 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

Thank you for your reply. I didn't know that this was the expected behavior 
(found it at https://wiki.python.org/moin/TimeComplexity):

"For these operations, the worst case n is the maximum size the container ever 
achieved, rather than just the current size. For example, if N objects are 
added to a dictionary, then N-1 are deleted, the dictionary will still be sized 
for N objects (at least) until another insertion is made."

Thank you for pointing out! 

I still disagree with this design since it's not how a person using a hash map 
expects it to behave. That said, this link clearly shows that it's *not* a bug, 
and that we just have different opinions on what is the best trade-off between 
space, code complexity, and speed, especially on something as ubiquitous as a 
dictionary.

Just one final attempt to gain support for my idea before I close the issue. 
When I proposed a linked list I didn't mean an "intro to programming" linked 
list, with pointers and a new malloc for every node. I meant simply adding two 
fields to PyDictKeyEntry with the indices of the next and previous valid 
entries. There would be a tiny memory overhead, and we would get a data 
structure that behaves how one would reasonably expect it to (compare with hash 
table found in the standard libraries of other languages. It's usually 
unexpected for an operation in a data structure to take time proportional to 
its historical value).

I will close this issue in two days if there are no further replies. Thank you 
for the discussion!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-05 Thread Inada Naoki


Inada Naoki  added the comment:

iterating whole over the dict is O(n) where n is the historical max size of the 
dict.

On the other hand, there are no guarantee about `next(iter(d))` is O(1). The 
guarantee is O(n) at worst case. And your example is the worst case. So this is 
not a bug.

As Dennis wrote, we won't add doubly linked list to the dict only for this 
purpose.
OrderedDict is for the use case.

I don't have any better idea than bpo-32623. More ideas are welcome.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-03 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

Thank you again, Dennis.

I would disagree with your conclusion for mainly 3 reasons:

1. The linked list proposal would increase the memory usage by 2 Py_ssize_t per 
entry, plus another 2 for the whole dictionary. I don't think this is an 
overwhelming amount of extra memory for Python, but of course it's open for 
discussion. Insertions and removals would become marginally slower (I'll try to 
measure it), membership wouldn't suffer, and iteration would become O(1) per 
element iterated on. 

2. I think that this case is covered by "Dynamic Mappings" in the document 
you've linked to.

3. This is not about the ordering in ducts being first or second nature. Please 
read the example in bad_dict_example.py to see a bad case where hearing over a 
dict of size 1 can take milliseconds.

I want to reiterate (pun not intended) that this is not a use case for me, but 
it surprised me that dictionaries display this behavior. It's not hard to 
imagine a real use case where it just happens that someone deletes elements 
from a dictionary in insertion order. Or that someone deletes all keys but the 
last inserted (in any order), resulting in a dictionary that takes way too long 
to iterate on. 


Thank you for the discussion. :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-03 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

OrderedDict already uses a linked list alongside a dictionary and meets all of 
your timing expectations, but has a higher memory usage, as linked lists 
generally do.

Since so many Python objects use dictionaries under the hood, it would probably 
not be worth it to completely change the dict structure and the memory usage of 
Python as a whole to only benefit this uncommon case. See Objects/dictnotes.txt 
for the highest priority use cases of dicts.

Per the OrderedDict docs, ordering is only secondary for dicts, but it's the 
focus for OrderedDicts. Just like you wouldn't send a list to do the job of a 
deque or vice versa, I don't think you should send an dict to do the job of an 
OrderedDict or vice versa.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-03 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

I think the idea of augmenting the struts to maintain a linked list of elements 
together with periodically shrinking on removal would solve both time and space 
issues, right? 

Space will be always linear, and operations would still be constant amortized 
time (of course the worst case of removing will be linear because of the 
shrinking, but I guess this behavior is expected).

I wrote a worse example in bad_dict_example.py submitted to this issue.  This 
example would be fixed simply by the shrinking, but as is it's a pretty 
unexpected bug, as a simple sum(d.values()) can take milliseconds on a 
dictionary with only one entry (both key and value are int).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-03 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

Alternate idea: the dict could shrink, if required, on the iter() call, 
whenever a keys/values/items iterator is initialized.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-03 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

> Can https://bugs.python.org/issue32623 be a fix (or mitigation) of this issue?

If such a change is implemented and dictionaries shrink when their fill falls 
below 1/8, the behavior of `while d: del d[next(iter(d))]` will remain 
quadradic: there will be `ma_used - dk_size/8` (linear in len(d)) iterations of 
that loop that scan over half that many dummies on average.

However, such a change could nonetheless speed up `while d: del 
d[next(iter(d))]` by a constant factor. As of now, the loop scans over roughly 
n**2/2 dummies. After that change, assuming the dict starts 1/3 full so the 
size is 3*n, there would be:

(n - 3*n/8)**2 / 2 dummy scans before shrinking => 25/128 * n**2 scans
which would leave 3*n/8 entries remaining, on which the same process would 
repeat.
The total scans would be bounded by
(25/128 * n**2) + (25/128 * (3*n/8)**2) + (25/128 * (9*n/64)**2)) + ...
= 25/128 * n**2 * (1 + 9/64 + 81/4096 + ...)
= 25/128 * n**2 / (1 - 9/64)
= 25/128 * n**2 * (64 / 55)
= 5/22 * n**2

... just under half the number of dummies to scan.

If we were to shrink slightly more aggressively, at a fill of 1/6, it would be 
something like:

(n - 3*n/6)**2 / 2 dummy scans before shrinking => n**2 / 8 scans, leaving 
n/2 remaining
The total would then be bounded by
(n**2 / 8) + ((n/2)**2 / 8) + ((n/4)**2 / 8) + ...
= n**2 / 8 * (1 + 1/4 + 1/16 + ...)
= n**2 / 8 / (1 - 1/4)
= n**2 / 8 * (4 / 3)
= n**2 / 6

... about one third of the number of dummies to scan.

This ignores the linear overhead of actually doing the re-sizing, which would 
dominate on small tables, so maybe there could be some threshold below which 
there would be no shrinking, say 256 (dk_log2_size=8) or so.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-03 Thread Inada Naoki


Inada Naoki  added the comment:

Can https://bugs.python.org/issue32623 be a fix (or mitigation) of this issue?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-02 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
nosy: +methane, rhettinger, serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-02 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

Please find attached a more complete example of the issue I am reporting.

tl;dr: I can make `sum(d.values())` run in O(maximum_size_in_d's_history) 
instead of O(len(d)), even when len(d) == 1.

The linked list approach would work in terms of making it faster, but we would 
still be using too much space.

--
Added file: https://bugs.python.org/file50138/bad_dict_example.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-02 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

Thank you for your message!

I'm not particularly looking for an implementation, I was just surprised by
this behavior.

It can get even worse. Consider this example:

```
d = large_dict()
# remove all but one element of d, runs in quadratic time as explained above
while len(d) > 1:
del d[next(iter(d))]

# now iterating over d takes O(d), even when d has only one item:

# the following prints one key, but takes O(N)
for key in d:
print(key)
```

I think this example is better, since a person would expect iterating over
a singleton dictionary would be really fast, but it can be made as slow as
one wants. A "print_keys()" function would reasonably be expected to take
O(size of dictionary), but it doesn't.

Am I right to think that this is a performance bug?

On Fri, Jul 2, 2021, 8:10 PM Dennis Sweeney  wrote:

>
> Dennis Sweeney  added the comment:
>
> For what it's worth, using collections.OrderedDict gives constant-time
> behavior you're looking for.
>
> --
> nosy: +Dennis Sweeney
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-02 Thread Dennis Sweeney


Dennis Sweeney  added the comment:

For what it's worth, using collections.OrderedDict gives constant-time behavior 
you're looking for.

--
nosy: +Dennis Sweeney

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-02 Thread Daniel Fleischman


Daniel Fleischman  added the comment:

The linked list solution is not as easy to implement as I expected, because of 
insertions. :(

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44555] Dictionary operations are LINEAR for any dictionary (for a particular code).

2021-07-02 Thread Daniel Fleischman


New submission from Daniel Fleischman :

The following code takes quadratic time on the size of the dictionary passed, 
regardless of the dictionary (explanation below):

```
def slow_dictionary(d):
while len(d) > 0:
# Remove first element
key = next(iter(d))
del d[key]
```

The problem is that when an element is deleted a NULL/NULL placeholder is set 
in its place 
(https://github.com/python/cpython/blob/818628c2da99ba0376313971816d472c65c9a9fc/Objects/dictobject.c#L1534)
 and when we try to find the first element with `next(iter(d))` the code needs 
to skip over all the NULL elements until it finds the first non-NULL element 
(https://github.com/python/cpython/blob/818628c2da99ba0376313971816d472c65c9a9fc/Objects/dictobject.c#L1713).

I'm not sure of what is the best way to fix it, but note that simply adding a 
field to the struct with the position of the first non-NULL element is not 
enough, since a code that always deletes the SECOND element of the dictionary 
would still have linear operations.

An easy (but memory-wasteful) fix would be to augment the struct PyDictKeyEntry 
with the indices of the next/previous non empty elements, and augment 
_dictkeysobject with the index of the first and last non empty elements (in 
other words, maintain an underlying linked list of the non empty entries). With 
this we can always iterate in O(1) per entry.

(I tested it only on version 3.9.2, but I would be surprised if it doesn't 
happen in other versions as well).

--
components: Interpreter Core
messages: 396880
nosy: danielfleischman
priority: normal
severity: normal
status: open
title: Dictionary operations are LINEAR for any dictionary (for a particular 
code).
type: performance
versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 
3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com