from:"Tim Peters"

Re: [Python-Dev] PEP 572 and assert

2018-07-17 Thread Tim Peters

[Barry Warsaw]

> Thanks!  I thought it was cute.  It was just something that occurred to me
> as I was reviewing some existing code.  The intent wasn’t to use `subdirs`
> outside of the assert statement, but I’m warm to it because it means I
> don’t have to do wasted work outside of the assert statement, or repeat
> myself in the assert message part.
>

Because the latter ("repeat myself") is probably more tempting, I'll just
note that the "laziness" of using an assignment expression instead may well
have nudged you toward writing _better_ code too.

   assert len(subdirs := list(path.iterdir())) == 0, subdirs

Assuming the result of list(path.iterdir()) can change over time (seems
very likely),

   assert len(list(path.iterdir())) == 0,  list(path.iterdir())

_could_ end up both triggering and displaying an empty list in the
exception detail.  The assignment-expression version cannot.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Const access to CPython objects outside of GIL?

2018-07-17 Thread Tim Peters

[Radim Řehůřek ]

> Thanks Tim. That's one unambiguous answer.


I'm glad that part came across ;-)


> I did consider posting to python-list, but this seemed somewhat
> python-devish.
>

I agree it's on the margin - I don't mind.

Any appetite for documenting which foundational functions are const-safe in
> the sense I described? Perhaps we could help. What kind of analysis,
> demand or momentum is needed for the Python dev team to consider
> introducing such a contract?
>

Most are volunteers scratching their own itches, and sometimes paid to
scratch their employers' itches.  This isn't my itch, so not me.  Perhaps
someone else - but from the messages so far it appears that most who
replied suspect this is an "XY problem":

https://en.wikipedia.org/wiki/XY_problem

Python's been around for decades, and best I recall nobody has suggested
anything really akin to what you're asking for here.  There certainly are C
extensions that want GIL-free access to (possibly massive amounts of) data
visible from Python too.  `numpy` is the prime example of that.  As Antoine
hinted, people working on those came up with the "buffer protocol" instead
(documented in the Python/C API Reference Manual under the "Buffer
Protocol" section).

But nobody here knows anything about your app, so can't guess from here
whether the buffer protocol is of any interest to you.  The buffer protocol
is aimed at uniform layouts of (possibly massive amounts of) native
C-format data, not at GIL-free access to individual little Python objects.

To be honest, I did do some CPython source code staring already.


Which is the only way you can learn anything about whether breaking
fundamental rules can get you in trouble.


> And at least for the functions we care about, it wasn't all that painful
> (PyDict_GetItem being the trickiest). Doing this wholesale might be a
> superhuman task, but I'm thinking a practical subset could be relatively
> straightforward while still useful, 80/20.
>

If you want to pursue this, it's probably necessary to first specify the
set of API functions that need to change their requirements and promises,
and to specify exactly what they're supposed to require and promise instead.

Even then, you should expect resistance.  It's been, historically, hard
enough to avoid thread bugs in CPython's implementation even with the
blanket "no access without the GIL, period" current requirement.  Adding
some number of "but these functions don't care about the GIL, in some
circumstances they can't verify obtain, let alone enforce" exceptions would
complicate an area that's already delicate.

CPython's implementation not only requires that only one thread holds the
GIL, it also works to verity that at runtime, and raises a fatal exception
if it detects that it's even slightly confused about which thread currently
holds the GIL.  Adding weaker but unverifiable preconditions to some
functions isn't attractive on the face of it.

Note:  the kind of people who find the GIL extremely intrusive tend instead
to work on ways to eliminate the GIL entirely.  They typically give up
after a few years of intense pain ;-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Issue?

2018-08-22 Thread Tim Peters

I wrote Python's sort, so I may know something about it ;-)  To my
eyes, no, there's not "an issue" here, but a full explanation would be
very involved.

For some sorting algorithms, it's possible to guarantee a redundant
comparison is never made.  For example, a pure insertion sort.

But Python's sort is quite complex, using a number of distinct
strategies dynamically adapting to structure found in the data as it
goes along.  There's a possibility for a small bit of redundant work
whenever switching from one strategy to another.  That's typical of
"hybrid" strategies.

Your case involves a switch between Python's sort's two simplest
strategies.  Let's just look at what matters:

[22, 33, 11, 11]

The first step is searching for "a natural run", an initial
sub-sequence that's already sorted.  22 < 33 says the first two
elements are already in order.  Then 11 < 33 says that's the end of
the run.

That part is over now.

The next step is using a binary insertion sort to move 11 into the
already-sorted [22, 33] prefix.  A binary search starts looking "in
the middle" first, but a 2-element run is so short that 33 "is the
middle", so it first compares 11 to 33 _again_.  So it goes.  Code to
detect that special case could be added, but it would probably cost
more than it saves (it always costs time to test for it, but that time
is pure waste unless the tests succeed).

This is specific to an ascending natural run of length 2, so is quite
a special case.  For example, in

[22, 33, 44, 11, 11]

11 < 44 ends the natural run, and then the binary search first
compares 11 to 33 ("the middle" of the 3-element natural run), and no
redundant comparison gets made.  But in

[22, 33, 44. 43, 11]

43 gets compared to 44 again on the _second_ iteration of the binary
search loop.

In general, this particular strategy switch ends up doing at most one
redundant comparison only if the first element after an ascending
natural run belongs immediately before the last element of the run.
For technical reasons, this can happen at most

min(len(the_list) / 32, number_of_natural_runs_in_the_list)

times, so it's generally trivial in an average-case O(N log N)
process.  It's so rare it would be minor in an O(N) process too,
unless N is very small.

A principled way to avoid this would be to skip the "find a run" step
if N is very small, leaving the whole thing to binary insertion sort.
Then a redundant comparison would never be done for small N.  But that
would end up doing more comparisons _overall_ in common cases where a
short list starts with a relatively (compared to N) sizable ascending
or descending run.

So I'm happy enough with the tradeoffs already in place.

On Wed, Aug 22, 2018 at 10:37 AM 楼晓峰 <1520306...@qq.com> wrote:
> Why compare twice?
>
> class Student(object):
>
> def __init__(self, name, score):
> self.name = name
> self.score = score
>
> def __str__(self):
> return '(%s: %s)' % (self.name, self.score)
>
> __repr__ = __str__
>
> def __lt__(self, s):
> #print(self, '--', s)
> if(self.score print(self, '<', s)
> return True
> if(self.score>s.score):
> print(self, '>', s)
> return False
> if(self.score==s.score):
> if(self.name>s.name):
> print(self, '>', s)
> return False
> if(self.name print(self, '<', s)
> return True
> if(self.name==s.name):
> print(self, '==', s)
> return False
>
> def __eq__(self, s):
> return (self.score == s.score) and (self.name == s.name)
> def __gt__(self, s):
> return not ((self == s) or (self < s))
> def __le__(self, s):
> return ((self == s) or (self < s))
> def __ge__(self, s):
> return ((self == s) or (self > s))
> def __nq__(self, s):
> return not (self == s)
>
> L = [Student('Tim', 22), Student('Bob', 33), Student('Kevin', 11), 
> Student('Alice', 11)]
> print(sorted(L))
>
> Output:
> (Bob: 33) > (Tim: 22)
> (Kevin: 11) < (Bob: 33)
> (Kevin: 11) < (Bob: 33)
> (Kevin: 11) < (Tim: 22)
> (Alice: 11) < (Tim: 22)
> (Alice: 11) < (Kevin: 11)
> [(Alice: 11), (Kevin: 11), (Tim: 22), (Bob: 33)]
>
> Best regards,
> Xiaofeng
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/tim.peters%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Why does the Contributor Agreement need my address?

2018-09-08 Thread Tim Peters

[Joseph C. Sible 
> I'm used to signing CLA's that require nothing beyond a name and a check
> box. When I went to sign the PSF Contributor Agreement so I can submit
> a PR for CPython, I was surprised to see that it wants my address. Why
> does the Python Software Foundation need this, especially when nobody
> else does?

So that our marketing partners can deliver exciting consumer shopping
opportunities directly to your front door ;-)

Seriously, "nobody else does" shows you haven't looked much.  For
example, the first two I just looked at also require a mailing
address:

Apache CLA
https://www.apache.org/licenses/icla.pdf

Android CLA
https://cla.developers.google.com/clas/new?domain=DOMAIN_GOOGLE&kind=KIND_INDIVIDUAL

So I'll guess that projects big enough to hire actual lawyers require
an address.  As to why they want an address, you'll have to ask a
lawyer!  There aren't any on this list.  So, if you really want to
pursue this, I suggest you direct the question instead to the Python
Software Foundation, which deals with the project's legalities:

p...@python.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] ctypes: is it intentional that id() is the only way to get the address of an object?

2019-01-18 Thread Tim Peters

[MRAB]
>>  If I want to cache some objects, I put them in a dict, using the id as
>> the key. If I wanted to locate an object in a cache and didn't have
>> id(), I'd have to do a linear search for it.

[Greg Ewing ]
> That sounds dangerous. An id() is only valid as long as the object
> it came from still exists, after which it can get re-used for a different
> object.

The objects are the values in such a dict.

thedict[id(obj)] is obj

Therefore the objects can't become garbage before id(obj) is deleted
from the dict.

> So when an object is flushed from your cache, you would have
> to chase down all the places its id is being stored and eliminate them.

The dict itself keeps the objects alive.

> Are you sure you couldn't achieve the same thing more safely using
> weak references?

I can't say exactly what MRAB is doing.  I've done things "like that"
for decades, though, and have happily almost never used weakrefs.  I
wouldn't call my uses "caches", though - more like using dicts to
associate info with arbitrary objects (via using the object id as the
dict key), where the object implementations are out of my control and
don't properly support being used as dict keys.

This sometimes includes builtin mutable objects, like lists, or even
other dicts.

No such uses care about object addresses, though - just that id(obj)
returns a value usable as a dict key, unique among all reachable
objects at the time `id()` is called.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 556 threaded garbage collection & linear recursion in gc

2019-03-27 Thread Tim Peters

[Gregory P. Smith ]
> ...
> A situation came up the other day where I believe this could've helped.
>
> Scenario (admittedly not one most environments run into): A Python process
> with a C++ extension module implementing a threaded server (threads
> spawned by C++) that could call back into CPython within server request
> handlers.  (ie: how all threaded servers tend to work regardless of core
> loop implementation language)
>
> Python code in the application had done something (unknown to me, I
> didn't dive into their code) that built up large enough presumably nested
> or recursive data structures that the garbage collector, when triggered,
> would wind up in very deep recursion.  This caused a stack overflow
>  as the C++ spawned threads were only being given a 256k stack (to
> conserve virtual address space - there can potentially be a _ton_ of
> threads in this code).
>
> That had a C++ stack trace 1000+ levels deep repeating the pattern of
>
> ...
> @ 0x564d59bd21de 32  func_dealloc
> @ 0x564d59bce0c1 32  cell_dealloc
> @ 0x564d5839db41 48  tupledealloc
> @ 0x564d59bd21de 32  func_dealloc
> @ 0x564d59bce0c1 32  cell_dealloc
> @ 0x564d5839db41 48  tupledealloc
> ...
>
> If our gc were done on a thread of its own spawned by Python, with a typical
> normal larger default stack size (8mb) this would've been less likely
> to trigger a crash (though obviously still possible if the recursion is 
> linear).

Just noting that gc is probably a red herring in that scenario.  gc
(cleverly!) determines the set of trash objects without needing any
recursive graph crawls.  And gc doesn't deallocate anything - not
directly.  It simply applies Py_DECREF to trash objects, once for each
PyObject* found pointing to them.  _If_ deallocation occurs as a
result, it's the same as if the user had done `del` on the appropriate
object manually.  The recursion then occurs in the chain of
XXX_dealloc calls, which in turn do more Py_DECREFs of their own, and
which have nothing in particular to do with gc.  Approximately the
same stack depth would be hit in a Python built without gc if the user
did the same sequence of `del`s by hand to break trash cycles.

The good news is that the "trashcan" mechanism - which predates gc -
was introduced to limit call depth in the presence of some potentially
unbounded chains of dealloc calls.  So teach trashcan about the
pattern here, and the stack depth problem goes away, with or without
gc.

The bad news is that the traschcan mechanism is excruciating, a
long-time source of subtle bugs of its own :-(

Note:  without actual code to run, it's possible that trashcan is
_already_ trying to limit stack depth in this specific case, but the
stack size is too small to accommodate the largest depth of recursive
calls trashcan is willing to allow before interfering.  Or some bug
snuck into trashcan, or its invoking code, that causes it to fail to
work as intended due to some unknown detail of the code above.
There's just no way to guess without code to provoke it.


> Another take away from this is that it appears possible to cause our gc
> to go into linear recursion.

As above, it's probably not gc, but deallocation as a side effect of
Py_DECREF dropping a refcount to 0.  gc is just one way Py_DECREF can
get invoked.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 556 threaded garbage collection & linear recursion in gc

2019-03-27 Thread Tim Peters

[Gregory P. Smith ]
> Good point, I hadn't considered that it was regular common ref
> count 0 dealloc chaining.

It pretty much has to be whenever you see a chain of XXX_dealloc
routines in a stack trace.  gcmodule.c never even looks at a
tp_dealloc slot directly, let alone directly invoke a deallocation
method.  That all happens indirectly, as a result of what Py_DECREF
does.  Then once you're first inside one tp_dealloc method, gc is
completely irrelevant - it's that tp_dealloc for the top-level
container does its own Py_DECREF on a contained container, which in
turn can do _its_ own Py_DECREF on one of its contained containers
 etc.  You can get an arbitrarily deep stack of XXX_dealloc calls
then, and there's really no other way to get that.

BTW,  "container" here is used in a very broad C-level sense, not a
high-level Python sense:  any PyObject that contains a pointer to a
PyObject is "a container" in the intended sense.

> The processes unfortunately didn't have faulthandler enabled so there wasn't
> much info from where in the python code it happened (now fixed).

It's quite possible that the top-level container was Py_DECREF'ed by
code in gcmodule.c.  But gc gets blamed at first for a lot of stuff
that's not actually its fault ;-)

>  I'll see if anything looks particularly unusual next time I hear of such a 
> report.

The trashcan mechanism is the one and only hack in the code intended
to stop unbounded XXX_dealloc stacks, so that's what needs looking at.
Alas, it's hard to work with because it's so very low-level, and
there's nothing portable that can be relied on about stack sizes or
requirements across platforms or compilers.

Some possibilities:

- The trashcan code is buggy.

- The  maximum container dealloc stack depth trashcan intends to allow
(PyTrash_UNWIND_LEVEL = 50) is too large for the C stack a thread gets
under this app on this platform using this compiler.

- One or more of the specific container types involved in this app's
dealloc chain doesn't use the trashcan gimmick at all, so is invisible
to trashcan's count of how deep the call stack has gotten.

For example, cell_dealloc was in your stack trace, but I see no use of
trashcan code in that function (Py_TRASHCAN_SAFE_BEGIN /
Py_TRASHCAN_SAFE_END).  So the trashcan hack has no idea that
cell_dealloc calls are on the stack.

And likewise for func_dealloc.- looks like calls to that are also
invisible to the trashcan.

tupledealloc is cool, though.

IIRC, Christian Tismer introduced the trashcan because code only he
wrote ;-) was blowing the stack when very deeply nested lists and/or
tuples became trash.

>From a quick scan of the current code, looks like it was later added
to only a few types that aren't container types in the Python sense.

Which may or may not matter here.  Your stack trace showed a
tupledealloc in one of every three slots, so even if two of every
three slots were invisible to the traschcan, the call stack "should
have been" limited to a maximum of about PyTrash_UNWIND_LEVEL  * 3 =
150 XXX_dealloc functions.  But you saw a stack 1000+ levels deep.  So
something else that isn't immediately apparent is also going on.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Easier debugging with f-strings

2019-05-07 Thread Tim Peters

[Larry Hastings ]
> Guido just stopped by--we're all at the PyCon 2019 dev sprints--and we had
> a chat about it.  Guido likes it but wanted us to restore a little of the 
> magical
> behavior we had in "!d": now, = in f-strings will default to repr (!r), unless
> you specify a format spec. If you specify a format spec it will always default
> to format.  And naturally if you specify an explicit conversion function (!r 
> !s !a)
> it will use that.
>
> This makes !f irrelevant, so we're removing it.
>
> Here's the thinking: 99% of the time the user will just use {foo=}, and for 
> that
> you want repr.  After that, 0.99% of the time the user will want a format spec
> that applies directly the value.  It's exceedingly unlikely that someone will
> want a format spec, but want it to apply to repr(value) and not value itself. 
>  So
> that's possible (f'{foo=!r:20}').  But the default behavior is the most common
> case at every step.

+1.  Perfect!  I'm one of the 0.99% who will frequently use this to
display floats, and really wants them to show as "0.99%" rather than
"0.9913499340289%".

BTW, that Guido person has made enough decent contributions by now
that I think he should be asked whether he wants to become a core dev!
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Definition of equality check behavior

2019-05-07 Thread Tim Peters

[Jordan Adler ]
> Through the course of work on the future polyfills that mimic the behavior
> of Py3 builtins across versions of Python, we've discovered that the
> equality check behavior of at least some builtin types do not match the
> documented core data model.
>
> Specifically, a comparison between a primitive (int, str, float were tested)
> and an object of a different type always return False, instead of raising
> a NotImplementedError.  Consider `1 == '1'` as a test case.
>
> Should the data model be adjusted to declare that primitive types are
> expected to fallback to False, or should the cpython primitive type's
> __eq__ implementation fallback to raise NotImplementedError?

Nope ;-)  This isn't a "data model" issue.  Look instead at the
Standard Library manual's section on Built-In Types, under heading
Comparisons:

"""
Objects of different types, except different numeric types, never
compare equal. ...
The <, <=, > and >= operators will raise a TypeError exception when
comparing a complex number with another built-in numeric type, when
the objects are of different types that cannot be compared, or in
other cases where there is no defined ordering.
"""

It's not an object's responsibility to arrange for that.  It's done
for them by default, and objects only need to supply their own rich
comparison methods if they don't want the defaults.  For example, when
comparing an int with another type, all the int rich comparison
methods _do_ return NotImplemented:

>>> f = 4
>>> f.__eq__("abc")
NotImplemented

It's at a higher level that comparison logic says "OK, I gave both
comparands a chance, and they both returned NotImplemented.  So one
last chance (from object.c's do_richcompare())":

/* If neither object implements it, provide a sensible default
   for == and !=, but raise an exception for ordering. */
switch (op) {
case Py_EQ:
res = (v == w) ? Py_True : Py_False;
break;
case Py_NE:
res = (v != w) ? Py_True : Py_False;
break;
default:
PyErr_Format(PyExc_TypeError,
 "'%s' not supported between instances of '%.100s'
and '%.100s'",
 opstrings[op],
 v->ob_type->tp_name,
 w->ob_type->tp_name);
return NULL;
}

Then the Py_EQ case of that delivers:

>>> f == "abc"
False

and the Py_NE case:

>>> f != "abc"
True

despite that (or because of that ;-) ):

>>> f.__eq__("abc")
NotImplemented
>>> "abc".__eq__(f)
NotImplemented

Note that there's nothing special about builtin types here.  All types
are treated alike.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-22 Thread Tim Peters

There's a Stackoverflow report[1] I suspect is worth looking into, but
it requires far more RAM (over 80GB) than I have).  The OP whittled it
down to a reasonably brief & straightforward pure Python 3 program.
It builds a ternary search tree, with perhaps a billion nodes.  The
problem is that it "takes forever" for Python to reclaim the tree
storage when it goes out of scope (the author waited at least hours).

Alas, the OP said it takes about 45 minutes to build the tree, and the
problem goes away if the tree is materially smaller.  So it takes a
long time just to try once.  With a tree about 10x smaller, for me it
takes about 45 seconds for Python to reclaim the tree storage.

The tree is deep enough that the "trashcan" may be implicated, and the
node objects are small enough that obmalloc carves them out of a
(relatively) great many arenas.  Those are two ways in which Python
_may_ be to blame.  The sheer number of objects involved may be
provoking edge cases we normally never see.

But, for a start, it would be good to know if anyone else can actually
reproduce the problem.

[1] 
https://stackoverflow.com/questions/56228799/python-hangs-indefinitely-trying-to-delete-deeply-recursive-object
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-23 Thread Tim Peters

[Inada Naoki ]
> ...
> 2. This loop is cleary hot:
> https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819

Which is 3 lines of code plus a closing brace.  The OP didn't build
their own Python, and the source from which it was compiled wasn't
available to them.  But they did say that when they got into gdb and
single-stepped, it was looping through the same three lines of code,
in obmalloc.c, over & over.  Which is 100% consistent with the loop
you identified as being "the hot spot".
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-23 Thread Tim Peters

It seems pretty clear now that the primary cause is keeping arenas
sorted by number of free pools. As deallocation goes on, the number of
distinct "# of free pools" values decreases, leaving large numbers of
arenas sharing the same value.  Then, e.g., if there are 10,000 arenas
with 30 free pools remaining, and another pool is freed in one of
them, its free pool count jumps to 31, and on average we have to crawl
over 5,000 of its former peers to locate its new position in the list.

Which is also consistent with what the OP saw, and I did too but in a
much smaller case:  the longer deallocation goes on, the slower it
gets (fewer and fewer Nodes freed per unit time).

Which suggests a different - albeit related - approach:  it's not
really the order of arenas that matters, but the partitioning of
arenas by number of free pools.  So instead of a singly doubly linked
list of arenas, have a vector of doubly linked lists, one list per
possible number of free pools.  That's a fixed and relatively small
number.  For example, even if all arena bytes could be used for pools
(they can't be - there's bookkeeping info at the start of an arena),
the maximum number of free pools in an arena would be 2**18 / 2**12 ==
2**6 == 64.

When a pool is freed, no big deal:  unlink the arena from its current
list, and stick it in the list for the next higher number of free
pools.  This, of course, is constant-time.

Of course there are edge cases (e.g., if the area is entirely free, it
should be given back to C instead), but they seem pretty obvious, and
I haven't thought of a real problem.

Tedious, though ;-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-24 Thread Tim Peters

[Inada Naoki ]
> For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge:

I'm unclear on what "nodes" means.  If you mean you changed 27M to 10M
in this line:

for token in random_strings(27_000_000):

that's fine, but there are about 40 times more than that `Node`
objects created by the program.  So if you did change that to
10_000_000, you'd have about 400M `Node` objects, consuming about 80x
that many bytes of RAM (or about 32GB).

> $ local/bin/python3 t1.py  # default
> 1138.1123778309993 -- end train, start del
> 688.7927911250008 -- end
>
> $ arena-1m/bin/python3 t1.py  # Changed ARENA_SIZE to 1MiB
> 1085.3363994170013 -- end train, start del
> 84.57135540099989 -- end

Nice!  I assume these are seconds.  On Stackoverflow, the OP reported
that boosting ARENA_SIZE the same way cut deallocation time in his
original program to about 13 minutes.


> $ PYTHONMALLOC=malloc local/bin/python3 t1.py
> 1157.4882792789995 -- end train, start del
> 27.91983470674 -- end

While the OP reported, for the original program:

"""
With PYTHONMALLOC=malloc, insertion takes nearly twice as long, but
deallocation only takes 2 minutes!
"""

The "nearly twice as long" for building the tree is in sharp contrast
to what you saw, but there's about 3x as much of everything in the
original program, and, e.;g., it's certainly possible malloc is
tripping over fragmentation then that obmalloc avoids.



> $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1
> PYTHONMALLOC=malloc local/bin/python3 t1.py
> 1098.4383037820007 -- end train, start del
> 117.93938426599925 -- end
>
> In this case, glibc malloc is the fastest.
> glibc is know to weak about fragmentation.
> But algorithm to avoid fragmentation is just an overhead in this script.

I can't say.

I'll note that the approach I very briefly sketched before
(restructure the list of arenas to partition it into multiple lists
partitioned by number of free pools) "should make" obmalloc
competitive with malloc here (it would eliminate all searches, except
perhaps (depending on how gonzo the code is) a rare need to search for
"the first" non-empty list).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-24 Thread Tim Peters

[Tim]
> I'll note that the approach I very briefly sketched before
> (restructure the list of arenas to partition it into multiple lists
> partitioned by number of free pools) "should make" obmalloc
> competitive with malloc here ...

But it's also intrusive, breaking up a simple linked list into a
mutli-headed beast.  That affects all code using it, not just the
parts mutating it.

So instead I suggest leaving `usable_arenas` entirely alone, but add a
vector of "search fingers", say,

static struct arenaobject* nfp2lasta[MAX_NUMBER_OF_FREE_POOLS_POSSIBLE + 1];

mapping a count of free pools to (a pointer to) the last arena object
in usable_arenas with that number of free pools.

Then the search loop in py_malloc_free() can be replaced with a single
array lookup.:  it leaps directly to where the search loop ends now.
For example, if we free a pool in an arena that had 23 free pools,
then the arena object's count of free pools has to be increased to 24,
and the arena object unlinked from its current position in
usable_arenas and inserted immediately after nfp2lasta[23].  Note that
usable_arenas is a doubly linked list, so there's no _need_ at all to
iterate over it.  Each node in the list knows what's immediately
before and after it.  And since a free pool count can only increase by
1, it's always correct to move the arena immediately after the last
arena with the same original free pool count.

Key invariants:

1. nfp2lasta[n] is NULL if and only if no arena in usable_arenas has
nfreepools == n.

2. nfp2lasta[pa->nfreepools] == pa if and only if pa is the only arena
in usable_arenas with that many free pools.

So, in the example above,  nfp2lasta[23] has to be set to NULL if and
only if the arena in question was the only one with 23 free pools
(which can be determined cheaply via invariant #2).

And nfp2lasta[24] has to be set to point to the arena in question if
and only if nfp2lasta[24] is NULL.

Tricky bit:  if the arena in question is the only one with a given
original free pool count, no pointers in arena objects need to be
changed at all.  Following the sketch above literally would leave you
trying to insert it after itself, which wouldn't end well  ;-)

Anyway, this "feels like a right solution" to me, eliminating all
searches with simple (albeit delicate) constant-time code, and
requiring code changes only where an arena's number of free pools can
change.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-24 Thread Tim Peters

[Tim]
> Key invariants:
> ...
> 2. nfp2lasta[pa->nfreepools] == pa if and only if pa is the only arena
> in usable_arenas with that many free pools.

Ack!  Scratch that.  I need a nap :-(

In fact if that equality holds, it means that nfp2lasta entry has to
change if pa is moved and pa->prevarena has the same count of free
pools.

So forget about the explanation and just think about the goal - the
details will work themselves out ;-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-26 Thread Tim Peters

[Larry Hastings ]
> I have a computer with two Xeon CPUs and 256GB of RAM.  So, even
> though it's NUMA, I still have 128GB of memory per CPU.  It's running a
> "spin" of Ubuntu 18.10.
>
> I compiled a fresh Python 3.7.3 --with-optimizations.  I copied the sample
> program straight off the StackOverflow page.  The program ran for about
> five and a half hours then exited normally.

Thanks, Larry!

> During the run it printed:
>
> This gets printed!
> This doesn't get printed
>
> Statistics reported by "time":
>
> 19811.05s user 123.56s system 99% cpu 5:32:15.04 total
>
> Checking in on it now and then, peak observed memory usage (as reported
> by "top") was just under 80GB.

All of that is consistent with what others reported, although you've
given more quantified details, and that's helpful.

> I take it that the interesting part was confirming that "This doesn't get 
> printed"
> gets printed when you have enough RAM for the program to run to completion.

That was the primary point at first, yes,

> So I guess there's no bug here?  Just an observation about CPython's
> garbage collector being kinda slow?  Or maybe CPython gc + swap =
> super bombad slow?

No need to confirm this:  if you ran it again with a bit more output,
you'd find that the vast bulk of the time was spent between the two
output lines.  That is, it takes hours from the time the train()
function starts to return and completes returning.  That's all
consumed as a side effect of `tree` losing its last reference.

We know why now, and it's hard to say whether it's "a bug".  It's
functioning as designed, but the design sucks ;-)  So I'd call it a
design flaw.

The heuristics that were introduced to try to make it likely we could
return more empty arenas to C were designed when machines had far less
RAM, and can burn time quadratic in the number of arenas.  That really
didn't matter in a world with a few hundred arenas, but can be a
timing disaster in a world with hundreds of thousands (or more) of
arenas.

The Stackoverflow test case was whittled way down from the OP's real
program, which ran "for days" without completing.

I've sketched two (albeit closely related, both to each other and to
what we currently do) ways the worst-case overall time could be cut
from quadratic to linear in the number of arenas, which is
asymptotically optimal.

More ambitious would be to dream up an entirely different heuristic.
Keeping the list of usable arenas fully sorted in order of number of
free pools seems a very strong requirement for a poke-and-hope
heuristic.

But, best I can tell, the person who came up with this heuristic to
begin with didn't check in any motivating test programs.  So we'd be
starting over from scratch.  For this specific Stackoverflow case,
I've confirmed that it doesn't make a real difference if we didn't
bother to sort the list of arenas _at all_ (then it still returns
about the same number of arenas to C, both at the point the tree
finishes building, and at the time it completes tearing down the
tree).

So I have a program showing the current scheme can be a disaster, but
none where it shows it actually helps anything ;-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-28 Thread Tim Peters

I made a pull request for this that appears to work fine for my 10x
smaller test case (reduces tear-down time from over 45 seconds to over
7).  It implements my second earlier sketch (add a vector of search
fingers, to eliminate searches):

https://github.com/python/cpython/pull/13612

It would be good to know how it works on the OP's 10x larger test
case, which requires a machine with over 80 GB of RAM.  Hoping it will
reduce tear-down time from hours to minutes (they have about a billion
`Node` objects, so tear-down time will never be trivial).

It would also be good to get a code review, if you have the time and
inclination.  Thanks!
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-30 Thread Tim Peters

The PR for this looks good to go:

https://github.com/python/cpython/pull/13612

But, I still have no idea how it works for the OP's original test
case.  So, if you have at least 80 GB of RAM to try it, I added
`arena.py` to the BPO report:

https://bugs.python.org/issue37029

That adds code to the OP's test case to display the times needed to
build the tree and to tear it down (& to display some obmalloc stats).
So there's no need for you to think about anything ;-)

I'm keen to get feedback on this before merging the PR, because this
case is so very much larger than anything I've ever tried that I'm
wary that there may be more than one "surprise" lurking here.  The PR
certainly addresses "an obvious" (with hindsight) problem - but is
that the _only_ gross design flaw here?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Have a big machine and spare time? Here's a possible Python bug.

2019-05-31 Thread Tim Peters

[Tim]
>> I'm keen to get feedback on this before merging the PR, because this
>> case is so very much larger than anything I've ever tried that I'm
>> wary that there may be more than one "surprise" lurking here. ...

[Inada Naoki ]
> I started r5a.4xlarge EC2 instance and started arena.py.
> I will post the result in next 12 hours.

Thank you!  Wrapping this up, Inada attached the results to the bug
report.  For the OP's original big case, the time to reclaim the tree
storage dropped from nearly 5 hours to about 70 seconds.  The time to
build the tree didn't materially change (it was a bit faster after the
patch, but offhand didn't look significantly faster to me).

Since I called this a "design flaw" rather than "a bug", I'm not
inclined to backport it.  If someone else thinks it should be, that's
up to them.  I'm _assuming_ Github won't decide to backport it on its
own - but maybe I'm wrong (Github workflow is still pretty magical to
me).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-01 Thread Tim Peters

[Antoine Pitrou, replying to Thomas Wouters]
> Interesting that a 20-year simple allocator (obmalloc) is able to do
> better than the sophisticated TCMalloc.

It's very hard to beat obmalloc (O) at what it does.  TCMalloc (T) is
actually very similar where they overlap, but has to be more complex
because it's trying to do more than O.

In either case, for small objects "the fast path" consists merely of
taking the first block of memory off a singly-linked size-segregated
free list.  For freeing, the fast path is just linking the block back
to the front of the appropriate free list.  What _could_ be faster?  A
"bump allocator" allocates faster (just increment a highwater mark),
but creates worlds of problems when freeing.

But because O is only trying to deal with small (<= 512 bytes)
requests, it can use a very fast method based on trivial address
arithmetic to find the size of an allocated block by just reading it
up from the start of the (4K) "pool" the address belongs to.  T can't
do that - it appears to need to look up the address in a more
elaborate radix tree, to find info recording the size of the block
(which may be just about anything - no upper limit).

> (well, of course, obmalloc doesn't have to worry about concurrent
> scenarios, which explains some of the simplicity)

Right, T has a different collection of free lists for each thread. so
on each entry has to figure out which collection to use (and so
doesn't need to lock).  That's not free.  O only has one collection,
and relies on the GIL.

Against that, O burns cycles worrying about something else:  because
it was controversial when it was new, O thought it was necessary to
handle free/realloc calls even when passed addresses that had actually
been obtained from the system malloc/realloc.  The T docs I saw said
"don't do that - things will blow up in mysterious ways".

That's where O's excruciating "address_in_range()" logic comes from.
While that's zippy and scales extremely well (it doesn't depend on how
many objects/arenas/pools exist), it's not free, and is a significant
part of the "fast path" expense for both allocation and deallocation.

It also limits us to a maximum pool size of 4K (to avoid possible
segfaults when reading up memory that was actually obtained from the
system malloc/realloc), and that's become increasingly painful:  on
64-bit boxes the bytes lost to pool headers increased, and O changed
to handle requests up to 512 bytes instead of its original limit of
256.  O was intended to supply "a bunch" of  usable blocks per pool,
not just a handful.  We "should" really at least double the pool and
arena sizes now.

I don't think we need to cater anymore to careless code that mixes
system memory calls with O calls (e.g., if an extension gets memory
via `malloc()`, it's its responsibility to call `free()`), and if not
then `address_in_range()` isn't really necessary anymore either, and
then we could increase the pool size.  O would, however, need a new
way to recognize when its version of malloc punted to the system
malloc.

BTW, one more:  last I saw T never returns memory to "the system", but
O does - indeed, the parent thread here was all about _enormous_ time
waste due to that in O ;-)  That's not free either, but doesn't affect
O's fast paths.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Tim Peters

[Antoine Pitrou ]
> The interesting thing here is that in many situations, the size is
> known up front when deallocating - it is simply not communicated to the
> deallocator because the traditional free() API takes a sole pointer,
> not a size.  But CPython could communicate that size easily if we
> would like to change the deallocation API.  Then there's no bother
> looking up the allocated size in sophisticated lookup structures.

That could work (to make it possible to increase obmalloc's pool
size).  Except ...

> I'll note that jemalloc provides such APIs:
> http://jemalloc.net/jemalloc.3.html
>
> """The dallocx() function causes the memory referenced by ptr to be
> made available for future allocations.
>
> The sdallocx() function is an extension of dallocx() with a size
> parameter to allow the caller to pass in the allocation size as an
> optimization."""

obmalloc doesn't intend to be a general-purpose allocator - it only
aims at optimizing "small" allocations, punting to the system for
everything beyond that.  Unless the size is _always_ passed in (on
every free() and realloc() spelling it supports), an "optimization"
doesn't help it much.  It needs a bulletproof way to determine whether
it, or system malloc/realloc, originally obtained an address passed
in.  If the size is always passed in, no problem (indeed, a single bit
could suffice).  But if it's ever possible that the size won't be
passed in, all the runtime machinery to figure that out on its own
needs to be able to handle all addresses.

Like now:  if the size were passed in, obmalloc could test the size
instead of doing the `address_in_range()` dance(*).  But if it's ever
possible that the size won't be passed in, all the machinery
supporting `address_in_range()` still needs to be there, and every
obmalloc spelling of malloc/realloc needs to ensure that machinery
will work if the returned address is passed back to an obmalloc
free/realloc spelling without the size.

The "only"problem with address_in_range is that it limits us to a
maximum pool size of 4K.  Just for fun, I boosted that to 8K to see
how likely segfaults really are, and a Python built that way couldn't
even get to its first prompt before dying with an access violation
(Windows-speak for segfault).

Alas, that makes it hard to guess how much value there would be for
Python programs if the pool size could be increased - can't even get
Python started.

We could eliminate the pool size restriction in many ways.  For
example, we could store the addresses obtained from the system
malloc/realloc - but not yet freed - in a set, perhaps implemented as
a radix tree to cut the memory burden.  But digging through 3 or 4
levels of a radix tree to determine membership is probably
significantly slower than address_in_range.

I can think of a way to do it slightly faster than (but related to)
address_in_range, but it would (on a 64-bit box) require adding 24
more bytes for each system-malloc/realloc allocation.  8 of those
bytes would be pure waste, due to that the Alignment Gods appear to
require 16-byte alignment for every allocation on a 64-bit box now.

In stark contrast, the extra memory burden of the current
address_in_range is an insignificant 8 bytes per _arena_ (256 KB, and
"should be" larger now).

Another approach:  keep address_as_range as-is, but add new internal
structure to larger pools, to repeat the arena index every 4KB.  But
that fights somewhat against the goal of larger pools.

Etc. ;-)

(*) Actually not quite true.  If a "large" object is obtained from
obmalloc now (meaning it actually came from the system malloc), then
cut back to a "small" size by a realloc, it _remains_ under the
control of the system malloc now.  Passing in the newer "small" size
to a free() later would cause obmalloc to get fatally confused about
that.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Tim Peters

[Antoine Pitrou ]
> But my response was under the assumption that we would want obmalloc to
> deal with all allocations.

I didn't know that.  I personally have no interest in that:  if we
want an all-purpose allocator, there are several already to choose
from.  There's no reason to imagine we could write a better one.

> Which is more or less required anyway to have an efficient GC that doesn't
> have to walk linked lists and access memory in random order to iterate over
> known objects.

As the parent thread showed, obmalloc does at least as well as any
general-purpose allocator known _for Python's purposes_ (a great many
(de)allocations of "small" objects).  Already explained too that
size-segregated linked free lists are the core of _why_ it does so
well.  Besides making carving off, and freeing, blocks dirt cheap,
linked lists also naturally support recycling memory blocks in MRU
("freshness in cache") order.

But I don't know what you mean by "access memory in random order to
iterate over known objects".  obmalloc never needs to iterate over
known objects - indeed, it contains no code capable of doing that..
Our cyclic gc does, but that's independent of obmalloc.  Over on
Discourse, Neil is speculating about using radix trees for cyclic gc
instead of _its_ linked lists  In obmalloc, allocated memory regions
aren't linked at all.  It's free regions that are linked, and
helpfully so in MRU order.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-06 Thread Tim Peters

[Tim]
>> But I don't know what you mean by "access memory in random order to
>> iterate over known objects".  obmalloc never needs to iterate over
>> known objects - indeed, it contains no code capable of doing that..
>> Our cyclic gc does, but that's independent of obmalloc.

[Antoine]
> It's not.  Cyclic GC needs its own linked lists *because* the allocator
> doesn't allow it to iterate over allocated objects.

The doubly linked lists in gc primarily support efficient
_partitioning_ of objects for gc's purposes (a union of disjoint sets,
with constant-time moving of an object from one set to another, and
constant-time union of disjoint sets).  "All objects" is almost never
interesting to it (it is only when the oldest non-frozen generation is
being collected).

Between collections, the partitioning is by generation.

During a collection, the youngest generations are combined into one,
and then that's sub-partitioned in various ways as collection goes
along, ending with a partition into reachable and unreachable objects.
In between, ephemeral partitions are constructed (e.g., the set of
"tentatively unreachable" objects).

None of that was driven by obmalloc's (or malloc's) inability to
iterate over objects.  Doubly linked lists were the obvious way to
implement the required operations on partitions efficiently and
simply.

In any case, it appears to be "a feature" now that people can use any
flavor of the malloc family they like in CPython, so I expect that any
attempt to tie cyclic gc to a specific flavor of malloc would be a
very hard sell.  Which, BTW, was the intended meaning of
"independent":  cyclic gc right now couldn't care less which version
of malloc a user plugs in - nor could obmalloc care less which cyclic
gc algorithm is used.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OR4NFZRQIOCK2N3XBCPI6GESI5BYRD3D/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-07 Thread Tim Peters

To be clearer, while knowing the size of allocated objects may be of
some use to some other allocators, "not really" for obmalloc.  That
one does small objects by itself in a uniform way, and punts
everything else to the system malloc family.  The _only_ thing it
wants to know on a free/realloc is which of those two ways delivered
the address to begin with.

Knowing the original size would be an indirect way to accomplish that,
but it really only needs 1 bit of info.  If you're using a radix tree
to implement - in effect - a dict mapping addresses to "stuff", 1 flag
bit in the "stuff" would be ideal for that.  If you haven't already, I
assume you'll soon come around to believe you really should be
tracking the addresses of all gc'ed objects (not just the "small"
ones).

> If the above idea works, we know the object size at free() and realloc(), we 
> don't
> need address_in_range() for those code paths.

Two things:  first, a single bit would not only be sufficient, it
would be _better_ than knowing the object size.  If the bit says the
object came from the system, the size is of no use at all.  It the bit
says it came from obmalloc, what's _really_ wanted is the index of the
object's "size class" in the vector of free lists, and that's already
directly available in the object's pool header (various parts of which
have to be read up &/or updated regardless).  Second, if that bit were
available, `address_in_range()` could be thrown away - obmalloc's free
and realloc entries are the only places it's used (or is of any
conceivable use to obmalloc).

For the current obmalloc, I have in mind a different way (briefly. let
s pool span a number of 4K pages, but teach pools about the page
structure so that address_in_range() continues to work after trivial
changes (read the arena index from the containing page's start rather
than from the containing pool's)).  Not ideal, but looks remarkably
easy to do, and captures the important part (more objects in a pool ->
more times obmalloc can remain in its fastest "all within the pool"
paths).

> My gut feeling is that the prev/next pointer updates done by
> move_unreachable() and similar functions must be quite expensive.
> Doing the traversal with an explicit stack is a lot less elegant but
> I think it should be faster.  At least, when you are dealing with a
> big set of GC objects that don't fit in the CPU cache.

At the start, it was the _reachable_ objects that were moved out of
the collected generation list rather than the unreachable objects.
Since it's (in all my experience) almost all objects that survive
almost all collections in almost all programs, that was an enormous
amount of moving overall.  So long I ago I rewrote that code to move
the _un_reachable objects instead.

Debug output showed countless billions of pointer updates saved across
the programs I tried at the time, but the timing difference was in the
noise.

Probably a big part of "the problem":  when collecting the oldest
generation in a program with "a lot" of objects, no approach at all
will "fit in cache".  We have to crawl the entire object graph then.
Note that the test case in the parent thread had over a billion class
instances, and it was way whittled down(!) from the OP's real program.

But I don't _think_ that's typical quite yet ;-) , and:

> You are likely correct. I'm hoping to benchmark the radix tree idea.
> I'm not too far from having it working such that it can replace
> address_in_range().  Maybe allocating gc_refs as a block would
> offset the radix tree cost vs address_in_range().

I certainly agree gc_refs is the big-bang-for-the-buck thing to
target.  There's no way to avoid mutating that bunches and bunches of
times, even in small programs, and reducing the number of dirtied
cache lines due to that "should" pay off.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3DPAR4PPPM3Z675ELBGYIJ74INI6PL3X/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-09 Thread Tim Peters

[Tim\
> For the current obmalloc, I have in mind a different way ...
> Not ideal, but ... captures the important part (more objects
> in a pool -> more times obmalloc can remain in its
> fastest "all within the pool" paths).

And now there's a PR that removes obmalloc's limit on pool sizes, and,
for a start, quadruples pool (and arena!) sizes on 64-bit boxes:

https://github.com/python/cpython/pull/13934

https://bugs.python.org/issue37211

As the PR says,

"""
It would be great to get feedback from 64-bit apps that do massive
amounts of small-object allocations and deallocations.
"""

:-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Z4YIHDGNZLP4WX5HVEBXDSFIDIWPTTYK/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-10 Thread Tim Peters

[Neil Schemenauer ]
> I've done a little testing the pool overhead.  I have an application
> that uses many small dicts as holders of data.  The function:
>
> sys._debugmallocstats()
>
> is useful to get stats for the obmalloc pools.  Total data allocated
> by obmalloc is 262 MB.  At the 4*PAGE_SIZE pool size, the wasted
> space due to partly filled pools is only 0.18%.  For 16*PAGE_SIZE
> pools, 0.71%.
>
> I have a set of stats for another program.  In that case, total
> memory allocated by obmalloc is 14 MB.  For 4*PAGE_SIZE pools,
> wasted space is 0.78% of total.  At 16*PAGE_SIZE, it is 2.4%.

Definitely a tradeoff here:  increasing the number of pages per pool
is a pure win for objects of very popular size classes, but a pure
loss for objects of unpopular size classes.  New comments about
POOL_SIZE in the patch briefly hint at that.

> Based on that small set of data, using 4*PAGE_SIZE seems
> conservative.

Which is a nice way of saying "pulled out of Tim's ass" ;-)

> As I'm sure you realize, making pools bigger will
> waste actual memory, not just virtual address space because you
> write the arena pointer to each OS page.

Yes, but mostly no!  It remains the case that obmalloc neither writes
nor reads anything in an arena until a malloc/realloc call actually
needs to return a pointer to a never-before accessed page.  Writing
the arena index is NOT done to pages when the arena is allocated, or
even when a new pool is carved off an arena, but lazily, one page at a
time, lowest address to highest, as fresh pages actually need to be
returned to the caller.

So arena pointers aren't actually written more frequently than when
using pools of just one page, as is done now (which _still_ writes the
arena index into each page).

Subtlety:  currently, if you pass a nonsense address to
address_in_range that happens to be in one of the arena's pools,
address_in_range() says "yes".  However, after the patch, it may well
return "no" if the address is in one of the pool's pages that hasn't
yet been returned to a malloc/realloc caller (in which case the arena
index hasn't yet been stored at the start of the page).

I don't care about that, because it's 100% a logic error to pass an
address to free/realloc that wasn't obtained from a previous
malloc/realloc call.  So it's a change in undefined behavior.

> I want to do performance profiling using Linux perf.  That should
> show where the hotspot instructions in the obmalloc code.  Maybe
> that will be useful to you.

Would be good to know, yes!  But may also depend on the app.

> Another thought about address_in_range(): some operating systems
> allow you to allocate memory a specific alignments.  Or, you can
> even allocate a chunk of memory at a fixed memory location if you do
> the correct magic incantation.  I noticed that Go does that.  I
> imagine doing that has a bunch of associated challenges with it.
> However, if we could control the alignment and memory location of
> obmalloc arenas, we would not have the segv problem of
> address_in_range().  It's probably not worth going down that path
> due to the problems involved.

I'm unclear on how that could be exploited.  It's not addresses that
come from arenas that create segv headaches, it's addresses that come
from the system's malloc/realloc.

If we were, e.g., to pick a maximum amount of address space obmalloc
can use in advance, we could live with a single arena of that size,
and just check whether an address is within it.  All the problems come
from that address space may alternate, any number of times, between
addresses in obmalloc arenas and addresses from the system malloc's
internals.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YUWFWPKN67EOGF5F7QFGB7IFRH5K2PFW/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-14 Thread Tim Peters

[Inada Naoki . to Neil S]
> Oh, do you mean your branch doesn't have headers in each page?

That's probably right ;-)  Neil is using a new data structure, a radix
tree implementing a sparse set of arena addresses.  Within obmalloc
pools, which can be of any multiple-of-4KiB (on a 64-bit box) size,
every byte beyond the pool header is usable for user data.  In my
patch, there is no new data structure, but it needs to store an "arena
index" at the start of every page (every 4K bytes) within a pool.

I certainly _like_ Neil's code better.  It's clean rather than
excruciatingly tricky.  The question is whether that's enough to
justify the memory burden of an additional data structure (which can
potentially grow very large).  So I've been working with Neil to see
whether it's possible to make it faster than my branch, to give it
another selling point people actually care about ;-)

Should also note that Neil's approach never needs to read
uninitialized memory, so we could throw away decades of (e.g.)
valgrind pain caused by the current approach (which my patch builds
on).


> https://bugs.python.org/issue32846
>
> As far as I remember, this bug was caused by cache thrashing (page
> header is aligned by 4K, so cache line can conflict often.)
> Or this bug can be caused by O(N) free() which is fixed already.

I doubt that report is relevant here, but anyone is free to try it
with Neil's branch.

https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

However, last I looked there Neil was still using 4 KiB obmalloc
pools, all page-aligned.  But using much larger arenas (16 MiB, 16
times bigger than my branch, and 64 times bigger than Python currently
uses).

But the `O(N) free()` fix may very well be relevant.  To my eyes,
while there was plenty of speculation in that bug report, nobody
actually dug in deep to nail a specific cause.  A quick try just now
on my branch (which includes the `O(N) free()` fix) on Terry Reedy's
simple code in that report shows much improved behavior, until I run
out of RAM.

For example, roughly 4.3 seconds to delete 40 million strings in a
set, and 9.1 to delete 80 million in a set.  Not really linear, but
very far from quadratic.  In contrast, Terry saw nearly a quadrupling
of delete time when moving from 32M to 64M strings

So more than one thing was going on there, but looks likely that the
major pain was caused by quadratic-time arena list sorting.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GXNDFA7YO6NP3IWIW4IIYX5XEIOW2FJH/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-14 Thread Tim Peters

[Neil Schemenauer ]
> ...
> BTW, the current radix tree doesn't even require that pools are
> aligned to POOL_SIZE.  We probably want to keep pools aligned
> because other parts of obmalloc rely on that.

obmalloc relies on it heavily.  Another radix tree could map block
addresses to all the necessary info in a current pool header, but that
could become gigantic.  For example, even a current "teensy" 256 KiB
arena can be carved into the order of 16K 16-byte blocks (the smallest
size class).  And finding a pool header now is unbeatably cheap:  just
clear the last 12 address bits, and you're done.


> Here is the matchup of the radix tree vs the current
> address_in_range() approach.
>
> - nearly the same in terms of performance.  It might depend on OS
>   and workload but based on my testing on Linux, they are very
>   close.  Would be good to do more testing but I think the radix
>   tree is not going to be faster, only slower.

People should understand that the only point to these things is to
determine whether a pointer passed to a free() or realloc() spelling
was obtained from an obmalloc pool or from the system malloc family.
So they're invoked once near the very starts of those two functions,
and that's all.

Both ways are much more expensive than finding a pool header (which is
just clearing some trailing address bits).

The current way reads an "arena index" out of the pool header, uses
that to index the file static `arenas` vector to get a pointer to an
arena descriptor, then reads the arena base address out of the
descriptor.  That;s used to determine whether the original address is
contained in the arena.  The primary change my PR makes is to read the
arena index from the start of the _page_ the address belongs instead
(the pool header address is irrelevant to this, apart from that a pool
header is aligned to the first page in a pool).

The radix tree picks bits out of the address three times to index into
a 3-level (but potentially broad) tree, ending with a node containing
info about the only two possible arenas the original address may
belong to.  Then that info is used to check.

The number of loads is essentially the same, but the multiple levels
of indexing in the tree is a teensy bit more expensive because it
requires more bit-fiddling.  I spent hours, in all, dreaming up a way
to make the _seemingly_ more complex final "so is the address in one
of those two arenas or not?" check about as cheap as the current way.
But Neil didn't see any significant timing difference after making
that change, which was mildly disappointing but not really surprising:
 arithmetic is just plain cheap compared to reading up memory.

> - radix tree uses a bit more memory overhead.  Maybe 1 or 2 MiB on a
>   64-bit OS.  The radix tree uses more as memory use goes up but it
>   is a small fraction of total used memory.  The extra memory use is
>   the main downside at this point, I think.

I'd call it the only downside.  But nobody yet has quantified how bad
it can get.


> - the radix tree doesn't read uninitialized memory.  The current
>   address_in_range() approach has worked very well but is relying on
>   some assumptions about the OS (how it maps pages into the program
>   address space).  This is the only aspect where the radix tree is
>   clearly better.  I'm not sure this matters enough to offset the
>   extra memory use.

I'm not worried about that.  The only real assumption here is that if
an OS supports some notion of "pages" at all, then for any address for
which the program has read/write access (which are the only kinds of
addresses that can be sanely passed to free/realloc), the OS allows
the same access to the entire page containing that address.

In two decades we haven't seen an exception to that yet, right?  It's
hard to imagine a HW designer thinking "I know!  Let's piss away more
transistors on finer-grained control nobody has asked for, and slow
down memory operations even more checking for that." ;-)


> - IMHO, the radix tree code is a bit simpler than Tim's
>   obmalloc-big-pool code.

Absolutely so.  There's another way to look at this:  if Vladimir
Marangozov (obmalloc's original author) had used an arena radix tree
from the start, would someone now get anywhere proposing a patch to
change it to the current scheme?  I'd expect a blizzard of -1 votes,
starting with mine ;-)

> ...
> My feeling right now is that Tim's obmalloc-big-pool is the best
> design at this point.  Using 8 KB or 16 KB pools seems to be better
> than 4 KB.  The extra complexity added by Tim's change is not so
> nice.  obmalloc is already extremely subtle and obmalloc-big-pool
> makes it more so.

Moving to bigger pools and bigger arenas are pretty much no-brainers
for us, but unless pool size is increased there's no particular reason
to pursue either approach - "ain't broke, don't fix".

Larry Hastings started a "The untuned tunable parameter ARENA_SIZE"
thread here about two years ago, where he got a blizzard of

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Tim Peters

[Tim. to Neil]
>> Moving to bigger pools and bigger arenas are pretty much no-brainers
>> for us, [...]

[Antoine]
> Why "no-brainers"?

We're running tests, benchmarks, the Python programs we always run,
Python programs that are important to us, staring at obmalloc stats
... and seeing nothing bad, nothing mysterious, only neutral-to-good
results.  So what's to think about? ;-)  These are 64-bit boxes, with
terabytes of virtual address space.  Asking for a meg of that is just
reserving less than a thousandth of a thousandth of that range of
integers, not actual physical RAM.

>  Bigger pools sound ok,

They're not necessarily independent choices.  Increase pool size
without increasing arena size, and the number of pools per arena
falls.  At the extreme, if pools were made the same size as arenas,
we'd need at least 32 arenas just to start Python (which uses at least
one pool of _every_ possible size class before you hit the prompt -
note that, on 64-bit boxes, the number of possible size classes is
falling from 64 (3.7) to 32 (3.8), due to some desire to make
everything aligned to 16 bytes - which I've already seen account for
some programs needing 100s of MB of more RAM).

> but bigger arenas will make Python less likely to return memory to the system.

I know that gets repeated a lot, and I usually play along - but why do
people believe that?  Where's the evidence?

At the start, obmalloc never returned arenas to the system.  The vast
majority of users were fine with that.  A relative few weren't.  Evan
Jones wrote all the (considerable!) code to change that, and I
massaged it and checked it in - not because there was "scientific
proof" that it was more beneficial than harmful (it certainly added
new expenses!) overall, but because it seemed like a right thing to
do, _anticipating_ that the issue would become more important in
coming years.

I'm still glad it was done, but no tests were checked in to _quantify_
its presumed benefits - or even to verify that it ever returned arenas
to the system.  Best I can tell, nobody actually has any informed idea
how well it does.  Evan stared at programs that were important to him,
and fiddled things until he was "happy enough".

Not everyone was.  About five years ago, Kristján Valur Jónsson opened
this report:

https://bugs.python.org/issue21220

suggesting a very different heuristic to try to free arenas.  The
"memcrunch..py" in his patch is the only time I've ever seen anyone
write code trying to measure whether obmalloc's arena-freeing is
effective.

I can verify that if you increase the number of objects in his script
by a factor of 100, my PR _never_ returns an arena to the system.  But
it's clear as mud to me whether his heuristic would  either (with the
100x smaller number of objects in the original script, the PR branch
does recycle arenas).

So that's the only objective evidence I have :-)

I've looked at obmalloc stats in other programs at various stages, and
saw nothing concerning.  memchunk.py appears to model object lifetimes
as coming from a uniform distribution, but in real life they appear to
be strongly multi-modal (with high peaks at the "way less than an eye
blink" and "effectively immortal" ends).


> We should evaluate what problem we are trying to solve here,

I'm not trying to solve a problem.  This is a "right thing to do"
thing, anticipating that slinging a massive number of objects on
massive machines will become ever more important, and that changing
20-year-old constants will allow obmalloc to spend more time in its
fastest paths instead of its slowest.  We haven't been especially
pro-active about giant machines, and are suffering from it:

https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/
"""
Update: Here is a “fun” Python performance bug that I ran into the
other day: deleting a set of 1 billion strings takes >12 hours.
Obviously, this particular instance can be fixed, but this exactly the
sort of thing that I would never have done a few years ago. A billion
strings seemed like a lot back then, but now we regularly discuss
multiple Terabytes of input data as “not a big deal”. This may not
apply for your settings, but it does for mine.
"""

That was _probably_ due to obmalloc's move-one-at-a-time way of
keeping its usable arenas list sorted, which sat un-diagnosed for over
a year.

https://bugs.python.org/issue32846

Fixing the underlying cause put giant machines on my radar, and
getting rid of obmalloc's pool size limit was the next obvious thing
that would help them (although not in the same universe as cutting
quadratic time to linear).

> instead of staring at micro-benchmark numbers on an idle system.

My only interest in those is that they're not slowing down, because
that's important too.  The aim here is much more to make life better
for programs slinging millions - even billions - of objects.
obmalloc's internal overheads are frugal, but not free.  For example,
it has to allocate at least 5

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-15 Thread Tim Peters

[Tim]
>> At the start, obmalloc never returned arenas to the system.  The vast
>> majority of users were fine with that.

[Neil]
> Yeah, I was totally fine with that back in the day.  However, I
> wonder now if there is a stronger reason to try to free memory back
> to the OS.  Years ago, people would typically have swap space that
> was as large or larger than their real RAM.  So, if the OS had to
> swap out unused pages, it wasn't a big deal.  Now that disks are
> relatively so much slower and RAM is larger, people don't have as
> much swap.  Some Linux systems get setup without any.  Freeing
> arenas seems more important than it used to be.

See my earlier "right thing to do ... anticipating" ;-)  It was clear
enough then that processor speeds were going to continue to increase
much faster than disk speeds, although the _primary_ driver at the
time was the then-small but increasing number of long-running Python
programs running big (for the time) data-crunching problems that
repeatedly went through stages of low and high memory need.


> OTOH, I don't think obmalloc should try too hard. The whole point of
> the small object allocator is to be really fast.  Anti-fragmentation
> heuristics are going to slow it down.  As far as I'm concerned, it
> works well enough as it is.

Freeing arenas _already_ slowed it down.  All the cruft to keep arenas
sorted by number of free pools was far from free.  Before that,  free
pools for a given size class were singly-linked together directly
(regardless of which arenas they were in), and we just used the head
of that list.   Arenas weren't involved at all.  Indeed, pools didn't
even point back to their arenas.

So major code and data changes were needed to implement the "take from
the non-full pool in the most heavily populated arena" heuristic.  The
other heuristic worth considering is "take from the non-full pool with
the smallest arena address", which directly strives to let the
higher-addressed arenas free up.

Since they're both just poke-&-hope, it's hard to out-think them.  But
it's possible one would consistently work better than the other, and
that even _which_ one could change depending on arena and pool sizes.
And should note that "poke-&-hope" is the best that can be hoped for,
since it's impossible to predict the lifetime of an object when space
for it is requested, and we can't move the object after its address is
returned to the caller.

More amusing:  address_in_range() didn't exist either at first.
Instead a pool header contained its own address and a "magic"
constant.  So, e.g.,

if (pool->pooladdr != pool || pool->magic != POOL_MAGIC) {
// This ain't ours!  Pass it to the system malloc family instead.
}

That was unlikely to believe something that _had_ come from the system
malloc really belonged to obmalloc, but proved very easy to trick into
making that error, even from pure Python code.

Good times ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GLE6UC7EKGZLBTFJVHUMMCKYXRRWU5GB/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Tim Peters

[Antoine]
> We moved from malloc() to mmap() for allocating arenas because of user
> requests to release memory more deterministically:
>
> https://bugs.python.org/issue11849

Which was a good change!  As was using VirtualAlloc() on Windows.
None of that is being disputed.  The change under discussion isn't
about mmap - mmap only incidentally gets sucked in here because it's
part of obmalloc's _the_ very slowest paths.  Again, I'm aiming at
_all_ of obmalloc's slower paths, on all platforms, at once.  mmap
isn't my focus.

That doesn't preclude anyone who cares lots about mmap from adding
more complication to cater specifically to it.

> And given the number of people who use Python for long-running
> processes nowadays, I'm sure that they would notice (and be annoyed) if
> Python did not release memory after memory consumption spikes.

The PR changes nothing about the "release arenas" heuristic or
implementation.  There's just unquantified speculation that boosting
arena size, on 64-bit boxes, from a trivial 256 KiB to a slightly less
trivial 1 MiB, may be disastrous.  The only evidence for that so far
is repetition of the speculation ;-)

>> ...
>> We haven't been especially
>> pro-active about giant machines, and are suffering from it:
>>
>> https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/

> So you're definitely trying to solve a problem, right?

By my definition of "a problem", no.  I have no quantified goals or
any criteria to say "and now it's solved".  I have open-ended concerns
about how Python will fare on giant machines slinging billions of
objects, and want to make things better _before_ users are provoked to
complain.  Which they'll do anyway, of course.  I'm not trying to
change human nature ;-)

> So the question becomes: does the improvement increasing the pool
> and arena size have a negative outcome on *other* use cases?

Which is why Neil & I have been running benchmarks, tests, Python
programs we run all the time, Python programs we care about ... and
actively soliciting feedback from people actually trying the code on
programs _they_ care about.

> Not everyone has giant machines.  Actually a frequent usage model is to
> have many small VMs or containers on a medium-size machine.

Something I've never done, so am wholly unqualified to judge.  I don't
even know what "many", "small", or "medium-size" might mean in this
context.  And I don't have a giant machine either, but spent most of
my professional career in the "supercomputer" business so at least
understand how those kinds of people think ;-)


>> For example, it has to allocate at least 56 bytes of separate bookkeeping 
>> info
>> for each arena.  Nobody cares when they have 100 arenas, but when there
>> are a million arenas (which I've seen), that adds up.

> In relative terms, assuming that arenas are 50% full on average
> (probably a pessimistic assumption?), that overhead is 0.08% per arena
> memory used.  What point is worrying about that?

You're only looking at one cost.  Those bytes aren't just address
reservations, they consume actual physical RAM.  The bookkeeping info
is periodically read, and mutated, over time.  In aggregate, that's
more than enough bytes to blow away an entire L3 cache.  The less of
that stuff needs to be read and written (i.e., the fewer arenas there
are), the less pressure that puts on the faster layers (caches) of the
memory hierarchy.

That bookkeeping info is also immortal:  all the bookkeeping
arena_object structs live a single contiguously allocated vector.  It
may move over time (via realloc()), but can never shrink, only grow.
Asking the platform malloc/realloc for a 50 MB chunk sucks on the face
of it :-(

So while the 50M is first-order trivial when a million arenas are in
use, if the program enters a new phase releasing almost all of it,
leaving (say) only 10 arenas still active, the 50 M is still there,
effectively wasting 200 arenas' worth of "invisible" (to
_debugmallocstats()) space forever.

About typical arena usage, I expect, but don't know, that 50% is quite
pessimistic.  It's a failure of project management (starting with me)
that _every_ step in the "free arenas" evolution was driven by a user
complaining about their program, and that nothing was ever checked in
to even ensure their problem remained solved, let alone to help find
out how effective arena-releasing is in an arbitrary program.  We've
been flying blind from the start, and remain blind.

That said, over the decades I've often looked at obmalloc stats, and
have generally been pleasantly surprised at how much of allocated
space is being put to good use.  80% seems more typical than 50% to me
based on that.

It's easy enough to contrive programs tending toward only 16 bytes in
use per arena, but nobody has ever reported anything like that.  The
aforementioned memcrunch.py program wasn't particularly contrived, but
was added to a bug report to "prove" that obmalloc's arena releasing
strategy

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-16 Thread Tim Peters

[Inada Naoki ]
> obmalloc is very nice at allocating small (~224 bytes) memory blocks.
> But it seems current SMALL_REQUEST_THRESHOLD (512) is too large to me.

For the "unavoidable memory waste" reasons you spell out here,
Vladimir deliberately set the threshold to 256 at the start.  As
things turned out, though, dicts drove the decision to boost it to 512
anyway.  A comment in the code says:

 * Note: a size threshold of 512 guarantees that newly created dictionaries
 * will be allocated from preallocated memory pools on 64-bit.

That was probably before "compact dicts", but, as-is, obmalloc can
handle small dicts now even if they're not empty.  And small dicts are
common.  Another relevant one pushing dicts to use obmalloc:

https://bugs.python.org/issue23601

> ...
> Another way to fix these problems is shrinking SMALL_REQUEST_THRESHOLD
> to 256 and believe malloc works well for medium size memory blocks.

Alas, it doesn't ;-)  Well, everyone who ever timed it found obmalloc
was significantly faster than the system malloc for every size class
obmalloc handles.  That's partly because obmalloc runs under
protection of the GIL, while the system malloc has to worry about
thread safety.

But, regardless, obmalloc is just plain hard to beat so long as it can
stay in its fast paths (which increasing pool and arena sizes aims at
helping it do).  Its current pool and arena sizes are tiny compared to
comparable structures in the high-profile allocators often mentioned
in these discussions.  However, the latter generally use
platform-specific ways to spell "free this OS page" rather than
waiting for an entire arena-like structure to become empty.

For example, "SuperMalloc"[1] effectively uses 2 MiB pools, in the
sense of "want 16 bytes? Cool, we'll start by devoting 2 MiB to
holding _only_ 16-byte objects.  Now you want 388?  Cool, we'll
reserved another 2 MiB to holding only 388-byte objects.  Etc."

[1] http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf


> ```
> >>> pool_size = 4096 - 48  # 48 is pool header size
> >>> for bs in range(16, 513, 16):
> ... n,r = pool_size//bs, pool_size%bs + 48
> ... print(bs, n, r, 100*r/4096)
> ...
> 16 253 48 1.171875
> 32 126 64 1.5625
> 48 84 64 1.5625
> 64 63 64 1.5625
> 80 50 96 2.34375
> 96 42 64 1.5625
> 112 36 64 1.5625
> 128 31 128 3.125
> 144 28 64 1.5625
> 160 25 96 2.34375
> 176 23 48 1.171875
> 192 21 64 1.5625
> 208 19 144 3.515625
> 224 18 64 1.5625
> 240 16 256 6.25
> 256 15 256 6.25
> 272 14 288 7.03125
> 288 14 64 1.5625
> 304 13 144 3.515625
> 320 12 256 6.25
> 336 12 64 1.5625
> 352 11 224 5.46875
> 368 11 48 1.171875
> 384 10 256 6.25
> 400 10 96 2.34375
> 416 9 352 8.59375
> 432 9 208 5.078125
> 448 9 64 1.5625
> 464 8 384 9.375
> 480 8 256 6.25
> 496 8 128 3.125
> 512 7 512 12.5
> ```
>
> There are two problems here.
>
> First, pool overhead is at most about 3.5 % until 224 bytes.
> But it becomes 6.25% at 240 bytes, 8.6% at 416 bytes, 9.4% at 464
> bytes, and 12.5% at 512 bytes.

Note that in sys._debugmallocstats() output, the total of the
"unusable for this reason" space is called "bytes lost to
quantization",  It's rarely a big deal in my experience, but certainly
tends to account for a higher percentage of arena space as the average
size of used objects increases.


> Second, some size classes have the same number of memory blocks.
> Class 272 and 286 have 14 blocks.  320 and 336 have 12 blocks.
> It reduces utilization of pools.  This problem becomes bigger on 32bit 
> platform.

I don't understand the "32bit platform" comment.  On 32-bit boxes,
alignment is to 8 byte boundaries (not 16 as above), and there are 64
size classes (not 32 as above).  Which tends to make space efficiency
better, not worse, than on 64-bit Python (16-byte alignment is a waste
for the vast majority of objects Python programs use, which rarely
require better than pointer alignment (on 64-byte boxes) or double (on
32-bit ones)).


> Increasing pool size is one obvious way to fix these problems.

Yes & no ;-)  In Neil's radix-tree branch your code above works fine
when the pool is increased.  But my PR doesn't introduce a new tree,
building on what obmalloc already does:  it absolutely needs to
"steal" 4 bytes at the start of every 4 KiB page - which turns into 16
bytes on 64-boxes to preserve 16-bit alignment.  The number of usable
blocks in a pool is computed by this function in the PR:

/* Return total number of blocks in pool of size index `i`, as a uint. */
static uint
NUMBLOCKS(int i)
{
assert(0 <= i && i < NB_SMALL_SIZE_CLASSES);
/* The first page burns space for a pool header, and remaining pages
 * burn ALIGNMENT bytes for the arena index.
 */
const uint size = INDEX2SIZE(i);
uint usable1 = SYSTEM_PAGE_SIZE - POOL_OVERHEAD;
uint usable2 = SYSTEM_PAGE_SIZE - ALIGNMENT;
return usable1 / size + (usable2 / size) * (PAGES_PER_POOL - 1);
}

For a 16-KiB pool, this allows us to (e.g.) squeeze out 6 more 16-byte
objects than we can

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters

[Inada Naoki]
>> Increasing pool size is one obvious way to fix these problems.
>> I think 16KiB pool size and 2MiB (huge page size of x86) arena size is
>> a sweet spot for recent web servers (typically, about 32 threads, and
>> 64GiB), but there is no evidence about it.

[Antoine]
> Note that the OS won't give a huge page automatically, because memory
> management becomes much more inflexible then.
>
> For example, the Linux madvise() man page has this to say about
> MADV_HUGEPAGE:
>
>   This feature is primarily aimed at  applications  that
>   use large mappings of data and access large regions of
>   that memory at a time  (e.g.,  virtualization  systems
>   such as QEMU).  It can very easily waste memory (e.g.,
>   a 2 MB mapping that only ever  accesses  1  byte  will
>   result  in  2 MB  of  wired memory instead of one 4 KB
>   page).  See the Linux kernel  source  file  Documenta‐
>   tion/vm/transhuge.txt for more details.
>
> I'm not sure a small objects allocator falls into the right use case
> for huge pages.

The SuperMalloc paper I recently pointed at notes that it uses huge
pages only for "huge" requests.  Not for "small", "medium", or "large"
requests.

But it carves up 2 MiB chunks. aligned at 2 MiB addresses, for each
size class anyway (which use 4K pages).

There are a mix of reasons for that.  Partly for the same reasons I
want bigger pools and arenas:  to stay in the fastest code paths.
Hitting page/arena/chunk boundaries costs cycles for computation and
conditional branches, and clobbers cache lines to access & mutate
bookkeeping info that the fast paths don't touch.

Also to reduce the fraction of allocator space "wasted" on bookkeeping
info.  48 header bytes out of a 4K pool is a bigger percentage hit
than, say, two 4K pages (to hold fancier allocator bookkeeping data
structures) out of a 2M chunk.

And partly for the same reason Neil is keen for bigger arenas in his
branch:  to reduce the size of data structures to keep track of other
bookkeeping info (in Neil's case, a radix tree, which can effectively
shift away the lowest ARENA_BITS bits of addresses it needs to store).

Which hints at much of why it wants "huge" chunks, but doesn't explain
why it doesn't want huge pages except to satisfy huge requests.
That's because it strives to be able to release physical RAM back to
the system on a page basis (which is also part of why it needs fancier
bookkeeping data structures to manage its chunks - it needs to keep
track of which pages are in use, and apply page-based heuristics to
push toward freeing pages).

So that combines very much larger "pools" (2M v 4K) with better
chances of actually returning no-longer-used pages to the system (on a
4K basis rather than a 256K basis).  But it's built on piles of
platform-specific code, and isn't suitable at all for 32-bit boxes
(it' relies on that virtual address space is an abundant resource on
64-bit boxes - reserving 2M of address space is close to trivial, and
could potentially be done millions of times without getting in
trouble).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CST4C3PF7EGNPEXN3F3LRPGJ7NNDAHTE/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters

[Tim]
> ...
> Here are some stats from running [memcrunch.py] under
> my PR, but using 200 times the initial number of objects
> as the original script:
>
> n = 2000 #number of things
>
> At the end, with 1M arena and 16K pool:
>
> 3362 arenas * 1048576 bytes/arena  =3,525,312,512
> # bytes in allocated blocks=1,968,233,888
> ...
>  With the larger arenas, none [arenas] were ever released.

...

> BTW, anyone keen to complicate the mmap management should first take
> this recent change into account::
>
> https://bugs.python.org/issue37257
>
> That appears to have killed off _the_ most overwhelmingly common cause
> of obmalloc counter-productively releasing an arena only to create a
> new one again milliseconds later.
>
> My branch, and Neil's, both contain that change, which makes it much
> harder to compare our branches' obmalloc arena stats with 3.7.  It
> turns out that a whole lot of "released arenas" under 3.7 (and will
> still be so in 3.8) were due to that worse-than-useless arena
> thrashing.

To illustrate, I reverted that change in my PR and ran exactly same
thing.  Wow - _then_ the 1M-arena-16K-pool PR reclaimed 1135(!) arenas
instead of none.  Almost all worse than uselessly.  The only one that
"paid" was the last:  the run ended with 3361 arenas still in use
instead of 3362.  Because with the change, one entirely empty arena
remained on the usable_arenas list.

So, word to the wise:  when looking at _debugmallocstats() output, like:

# arenas allocated total   =4,496
# arenas reclaimed =1,135
# arenas highwater mark=3,362
# arenas allocated current =3,361
3361 arenas * 1048576 bytes/arena  =3,524,263,936

the number "reclaimed" isn't really telling us much:  before 3.9, it
may be telling us only how many times obmalloc wasted time on useless
arena thrashing.

The _important_ bit is the difference between "highwater mark" and
"allocated current".  That's how much peak arena address reservation
declined.  In this run, it only managed to release one empty arena
from the peak (which the actual PR does not release, because bpo-37257
changed this to keep (at most) one empty arena available for reuse).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZFAIAW4VYZRSRG6QJRXTVVTY43562ONI/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters

Heh.  I wasn't intending to be nasty, but this program makes our arena
recycling look _much_ worse than memcrunch.py does.  It cycles through
phases.  In each phase, it first creates a large randomish number of
objects, then deletes half of all objects in existence.  Except that
every 10th phase, it deletes 90% instead.  It's written to go through
100 phases, but I killed it after 10 because it was obviously going to
keep on growing without bound.

Note 1:  to do anything deterministic with obmalloc stats these days
appears to require setting the envar PYTHONHASHSEED to 0 before
running (else stats vary even by the time you get to an interactive
prompt).

Note 2:  there are 3 heavily used size classes here, for ints,
2-tuples, and class instances, of byte sizes 32, 64, and 96 on 64-bit
boxes, under my PR and under released 3.7.3.

First with my branch, after phase 10 finishes building objects:

phase 10 adding 9953410
phase 10 has 16743920 objects

# arenas allocated total   =3,114
# arenas reclaimed =0
# arenas highwater mark=3,114
# arenas allocated current =3,114
3114 arenas * 1048576 bytes/arena  =3,265,265,664

# bytes in allocated blocks=3,216,332,784

No arenas have ever been reclaimed, but space utilization is excellent
(about 98.5% of arenas are being used by objects).

Then after phase 10 deletes 90% of everything still alive:

phase 10 deleting factor 90% 15069528
phase 10 done deleting

# arenas allocated total   =3,114
# arenas reclaimed =0
# arenas highwater mark=3,114
# arenas allocated current =3,114
3114 arenas * 1048576 bytes/arena  =3,265,265,664

# bytes in allocated blocks=  323,111,488

Still no arenas have been released, and space utilization is horrid.
A bit less than 10% of allocated space is being use for objects.

Now under 3.7.3.  First when phase 10 is done building:

phase 10 adding 9953410
phase 10 has 16743920 objects

# arenas allocated total   =   14,485
# arenas reclaimed =2,020
# arenas highwater mark=   12,465
# arenas allocated current =   12,465
12465 arenas * 262144 bytes/arena  =3,267,624,960

# bytes in allocated blocks=3,216,219,656

Space utilization is again excellent.  A significant number of arenas
were reclaimed - but usefully?  Let's see how things turn out after
phase 10 ends deleting 90% of the objects:

phase 10 deleting factor 90% 15069528
phase 10 done deleting

# arenas allocated total   =   14,485
# arenas reclaimed =2,020
# arenas highwater mark=   12,465
# arenas allocated current =   12,465
12465 arenas * 262144 bytes/arena  =3,267,624,960

# bytes in allocated blocks=  322,998,360

Didn't manage to reclaim anything!  Space utililization is again
horrid, and it's actually consuming a bit more arena bytes than when
running under the PR.

Which is just more of what I've been seeing over & over:  3.7.3 and
the PR both do a fine job of recycling arenas, or a horrid job,
depending on the program.

For excellent recycling, change this program to use a dict instead of a set.  So

data = {}

at the start, fill it with

data[serial] = Stuff()

and change

data.pop()

to use .popitem().

The difference is that set elements still appear in pseudo-random
order, but dicts are in insertion-time order.  So data.popitem() loses
the most recently added dict entry, and the program is then just
modeling stack allocation/deallocation.

def doit():
import random
from random import randrange
import sys

class Stuff:
# add cruft so it takes 96 bytes under 3.7 and 3.8
__slots__ = tuple("abcdefg")

def __hash__(self):
return hash(id(self))

LO = 5_000_000
HI = LO * 2
data = set()
serial = 0
random.seed(42)

for phase in range(1, 101):
toadd = randrange(LO, HI)
print("phase", phase, "adding", toadd)
for _ in range(toadd):
data.add((serial, Stuff()))
serial += 1
print("phase", phase, "has", len(data), "objects")
sys._debugmallocstats()
factor = 0.5 if phase % 10 else 0.9
todelete = int(len(data) * factor)
print(f"phase {phase} deleting factor {factor:.0%} {todelete}")
for _ in range(todelete):
data.pop()
print("phase", phase, "done deleting")
sys._debugmallocstats()

doit()
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/list

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-17 Thread Tim Peters

[Tim]
> ...
> Now under 3.7.3.  First when phase 10 is done building:
>
> phase 10 adding 9953410
> phase 10 has 16743920 objects
>
> # arenas allocated total   =   14,485
> # arenas reclaimed =2,020
> # arenas highwater mark=   12,465
> # arenas allocated current =   12,465
> 12465 arenas * 262144 bytes/arena  =3,267,624,960
>
> # bytes in allocated blocks=3,216,219,656
>
> Space utilization is again excellent.  A significant number of arenas
> were reclaimed - but usefully?

Nope!  Digging through all the output, all the arena recycling
happened in the "_add_ millions of objects" stages, never in the
"delete millions of objects" stages.  So it was just more accidental
arena thrashing (which was recently repaired in bpo-37257).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/JXQHYTTFKBE27DKZ532ENIMPNPE7PJ44/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-18 Thread Tim Peters

And one more random clue.

The memcrunch.py attached to the earlier-mentioned bug report does
benefit a lot from changing to a "use the arena with the smallest
address" heuristic, leaving 86.6% of allocated bytes in use by objects
at the end (this is with the arena-thrashing fix, and the current
256K/4K arena/pool sizes). Which isn't surprising, else the person
who opened that report wouldn't have attached that specific script ;-)

However, when I increase its number of starting objects by a factor of
200, changing the heuristic barely helped at all. It managed to
reclaim a few dozen more arenas by the end (out of 13,556 total arenas
allocated), but left only 56.1% of arena space in use by objects.

For the later program I posted, which by accident-rather-than-design
is even worse for arena cycling, changing the heuristic didn't help at
all. Either way, no arenas were released (although one empty arena is
retained in the usable_arenas list), and less than 10% of arena space
was in use at the end of phase 10.

A few things to note, then:

- For truly effective RAM releasing, we would almost certainly need to
make major changes, to release RAM at an OS page level. 256K arenas
were already too fat a granularity.

- Whether a program benefits from what we do now appears to rely by
far most on its patterns of allocation/deallocation, rather than on
the arena-releasing heuristic OR on the arena size.

- While the by-lowest-address heuristic is obviously more effective in
one known artificial program so far ;-) , it's more expensive than
what we do now. That's a full-blown sorting problem. Ordering by
number of free pools is not: that's essentially a bucket-distribution
problem, since the number of _possible_ values is small. The data
structure changes I already checked in truly take constant time, in
all cases, to restore sorted order when an arena's number of free
pools changes. When sorting by address, the best that can be done in
general is to take O(log(A)) time to insert a new arena (where A is
the number of arenas). And a linked list is no good for that either
(unless, e.g., it's made fancier, like a skip list).

The quick experiment I tried used one-at-time search to put an arena
in the by-address-ordered list (the submitted patch also did), and we
already know from experience that can become a timing disaster when
hundreds of thousands of arenas are in use. OTOH, that only needs to
be done once each time an arena is added to usable_arenas, not every
time an arena's number of free pools changes.

So, overall:

- I'm not inclined to pursue changing the arena-freeing heuristic.

- Changes to keep track of OS page-level use would likely require
starting over from scratch. But would also make using far larger
pools practical (on 64-bit boxes).

- I still haven't found a program where changing from 256K to 1M
arenas makes a lick of appreciable difference in arena recycling
effectiveness: for a given program chewing on a given problem size,
"does pretty well" or "does horribly" has been the outcome either way.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/F2LEVIFQL55EVSCCIBKDFSPPTG3KY3P2/

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-18 Thread Tim Peters

[Tim]
> - For truly effective RAM releasing, we would almost certainly need to
> make major changes, to release RAM at an OS page level.   256K arenas
> were already too fat a granularity.

We can approximate that closely right now by using 4K pools _and_ 4K
arenas:  one pool per arena, and mmap()/munmap() are then called on
one page at a time.

[Don't try this at home ;-)  There are subtle assumptions in the code
that there are at least two pools in an arena, and those have to be
overcome first.]

For memcrunch.py, using 200x the original initial objects, this works
quite well!  Note that this still uses our current release-arenas
heuristic:  the only substantive change from the status quo is setting
ARENA_SIZE to POOL_SIZE (both 4 KiB - one page).

# arenas allocated total   =  873,034
# arenas reclaimed =  344,380
# arenas highwater mark=  867,540
# arenas allocated current =  528,654
528654 arenas * 4096 bytes/arena   =2,165,366,784

# bytes in allocated blocks=1,968,234,336
# bytes in available blocks=  141,719,280
5349 unused pools * 4096 bytes =   21,909,504
# bytes lost to pool headers   =   25,118,640
# bytes lost to quantization   =8,385,024
# bytes lost to arena alignment=0
Total  =2,165,366,784

So, at the end, space utilization is over 90%:

1,968,234,336 / 2,165,366,784 = 0.90896117

OTOH, an even nastier version of the other program I posted isn't
helped much at all, ending like so after phase 10:

# arenas allocated total   =1,025,106
# arenas reclaimed =   30,539
# arenas highwater mark=1,025,098
# arenas allocated current =  994,567
994567 arenas * 4096 bytes/arena   =4,073,746,432

# bytes in allocated blocks=  232,861,440
# bytes in available blocks=2,064,665,008
424741 unused pools * 4096 bytes   =1,739,739,136
# bytes lost to pool headers   =   27,351,648
# bytes lost to quantization   =9,129,200
# bytes lost to arena alignment=0
Total  =4,073,746,432

So space utilization is under 6%:

232,861,440 / 4,073,746,432 = 0.0571615

Believe it or not, that's slightly (but _only_ slightly) better than
when using the current 256K/4K arena/pool mix, which released no
arenas at all and ended with

232,861,440 / 4,199,022,592 = 0.05545611

utilization.

So:

- There's substantial room for improvement in releasing RAM by
tracking it at OS page level.

- But the current code design is (very!) poorly suited for that.

- In some non-contrived cases it wouldn't really help anyway.

A natural question is how much arena size affects final space
utilization for memcrunch.py.  Every successive increase over one pool
hurts, but eventually it stops mattering much.  Here are the possible
power-of-2 arena sizes, using 4K pools, ending with the smallest for
which no arenas get reclaimed:

1,968,234,336 / 2,165,366,784 = 0.90896117
528654 arenas * 4096 bytes/arena   =2,165,366,784
# bytes in allocated blocks=1,968,234,336

1,968,234,336 / 2,265,399,296 = 0.86882447
276538 arenas * 8192 bytes/arena   =2,265,399,296
# bytes in allocated blocks=1,968,234,336

1,968,235,360 / 2,441,314,304 = 0.80621957
149006 arenas * 16384 bytes/arena  =2,441,314,304
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 2,623,799,296 = 0.75014707
80072 arenas * 32768 bytes/arena   =2,623,799,296
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 2,924,216,320 = 0.67308131
44620 arenas * 65536 bytes/arena   =2,924,216,320
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,299,475,456 = 0.59652978
25173 arenas * 131072 bytes/arena  =3,299,475,456
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,505,913,856 = 0.56140437
13374 arenas * 262144 bytes/arena  =3,505,913,856
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,552,051,200 = 0.55411233
6775 arenas * 524288 bytes/arena   =3,552,051,200
# bytes in allocated blocks=1,968,235,360

1,968,235,360 / 3,553,624,064 = 0.55386707
3389 arenas * 1048576 bytes/arena  =3,553,624,064
# bytes in allocated blocks=1,968,235,360

Most of the damage was done by the time we reached 128K arenas, and
"almost all" when reaching 256K.

I expect that's why I'm not seeing much of any effect (on arena
recycling effectiveness) moving from the current 256K/4K to the PR's
1M/16K.  256K/4K already required "friendly" allocation/deallocation
patterns for the status quo to do real good, and 256K alread

[Python-Dev] Re: radix tree arena map for obmalloc

2019-06-18 Thread Tim Peters

[Tim]
>> - For truly effective RAM releasing, we would almost certainly need to
>> make major changes, to release RAM at an OS page level.   256K arenas
>> were already too fat a granularity.

[also Tim]
> We can approximate that closely right now by using 4K pools _and_ 4K
> arenas:  one pool per arena, and mmap()/munmap() are then called on
> one page at a time.

Sorry about this:  there was another exceedingly subtle "there's more
than one pool in an arena" assumption in the code, which didn't cause
anything to fail, but did cause it to hold on to empty arenas in some
cases.  Repaired that, and then:

> For memcrunch.py, using 200x the original initial objects, this works
> quite well!
> ...
> So, at the end, space utilization is over 90%:
>
> 1,968,234,336 / 2,165,366,784 = 0.90896117

Which changed to the slightly better:

1,968,235,360 / 2,143,465,472 = 0.91824916


> OTOH, an even nastier version of the other program I posted isn't
> helped much at all, ending like so after phase 10:
> ...
> 232,861,440 / 4,073,746,432 = 0.0571615

And that one enjoyed a relatively huge improvement, to:

232,861,440 / 2,334,990,336 = 0.09972694

But none of that changes any of the bottom-level observations or speculations.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WX5CJODDLQIQAYJRYY2FWPUPVR725SWV/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-21 Thread Tim Peters

[Tim]
>> I don't think we need to cater anymore to careless code that mixes
>> system memory calls with O calls (e.g., if an extension gets memory
>> via `malloc()`, it's its responsibility to call `free()`), and if not
>> then `address_in_range()` isn't really necessary anymore either, and
>> then we could increase the pool size.  O would, however, need a new
>> way to recognize when its version of malloc punted to the system
>> malloc.

[Thomas Wouters ]
> Is this really feasible in a world where the allocators can be selected (and
> the default changed) at runtime?

I think so.  See the "Memory Management" section of the Python/C API
Reference Manual.  It's always been "forbidden" to, e.g., allocate a
thing with PyMem_New() but release it with free().  Ditto mixing a
PyMem_Raw... allocator with a PyMem... deallocator, or PyObject...
one.  Etc.

A type's tp_dealloc implementation should damn well which memory
family the type's allocator used,

However, no actual proposal on the table changes any "fact on the
ground" here.  They're all as forgiving of slop as the status quo.

> And what would be an efficient way of detecting allocations punted to
> malloc, if not address_in_range?

_The_ most efficient way is the one almost all allocators used long
ago:  use some "hidden" bits right before the address returned to the
user to store info about the block being returned.  Like 1 bit to
distinguish between "obmalloc took this out of one of its pools" and
"obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc
doesn't care)".  That would be much faster than what we do now.

But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain
alignment, so space overhead becomes 100% for the smallest objects
obmalloc can return :-(

Neil Schemenauer takes a different approach in the recent "radix tree
arena map for obmalloc" thread here.  We exchanged ideas on that until
it got to the point that the tree levels only need to trace out
prefixes of obmalloc arena addresses.  That is, the new space burden
of the radix tree appears quite reasonably small.

It doesn't appear to be possible to make it faster than the current
address_in_range(), but in small-scale testing so far speed appears
comparable.


> Getting rid of address_in_range sounds like a nice idea, and I would love to 
> test
> how feasible it is -- I can run such a change against a wide selection of code
> at work, including a lot of third-party extension modules, but I don't see an 
> easy
> way to do it right now.

Neil's branch is here:

 https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

It's effectively a different _implementation_ of the current
address_in_range(), one that doesn't ever need to read possibly
uninitialized memory, and couldn't care less about the OS page size.

For the latter reason, it's by far the clearest way to enable
expanding pool size above 4 KiB.  My PR also eliminates the pool size
limitation:

https://github.com/python/cpython/pull/13934

but at the cost of breaking bigger pools up internally into 4K regions
so the excruciating current address_in_range black magic still works.

Neil and I are both keen _mostly_ to increase pool and arena sizes.
The bigger they are, the more time obmalloc can spend in its fastest
code paths.

A question we can't answer yet (or possibly ever) is how badly that
would hurt Python returning arenas to the system, in long-running apps
the go through phases of low and high memory need.

I don't run anything like that - not a world I've ever lived in.  All
my experiments so far say, for programs that are neither horrible nor
wonderful in this respect:

1. An arena size of 4 KiB is most effective for that.
2. There's significant degradation in moving even to 8 KiB arenas.
3. Which continues getting worse the larger the arenas.
4. Until reaching 128 KiB, at which point the rate of degradation falls a lot.

So the current 256 KiB arenas already suck for such programs.

For "horrible" programs, not even tiny 4K arenas help much.

For "wonderful" programs, not even 16 MiB arenas hurt arena recycling
effectiveness.

So if you have real programs keen to "return memory to the system"
periodically, it would be terrific to get info about how changing
arena size affects their behavior in that respect.

My PR uses 16K pools and 1M arenas, quadrupling the status quo.
Because "why not?" ;-)

Neil's branch has _generally_, but not always, used 16 MiB arenas.
The larger the arenas in his branch, the smaller the radix tree needs
to grow.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7ZIFV2BEL64FQGC35F7QUPK3SHVR3VGT/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Tim Peters

[Thomas]
>>> And what would be an efficient way of detecting allocations punted to
>>> malloc, if not address_in_range?

[Tim]
>> _The_ most efficient way is the one almost all allocators used long
>> ago:  use some "hidden" bits right before the address returned to the
>> user to store info about the block being returned.

[Antoine]
> There's a fundamental problem here: you can't be sure that all
> allocators reserve such space.  If some allocator doesn't, it can
> return a pointer just at the very start of the page, and if obmalloc
> tries to look at "a few bits before" that address, it could very well
> page-fault.

I snipped some technical but crucial context in my reply to Thomas:
this was assuming users are following the documented requirement to
never mix memory calls from different families.

What you describe certainly could happen in "illegal" code that. e.g.,
obtained a block from the system malloc, but passed it to obmalloc to
free.  Which, in reality, works fine today, although the docs forbid
it.  (I'm not sure, but there _may_ be some mode of running today that
actually enforces the doc requirements.)

In the other world, where code is playing by the rules, if an obmalloc
malloc/;realloc spelling is called, and it needs to punt to a
different allocator, no problem:  first it boosts the size request so
it has room to store "the bit" it needs before the address it actually
returns to the client.  Then it's "legal" only to free that memory
with an obmalloc spelling of free() later - obmalloc reads "the bit",
sees "oh - that's not my memory!", and adjusts the pointer accordingly
on _its_ call to the spelling of free() corresponding to the memory
family obmalloc() used to get the memory to begin with.

>> Neil Schemenauer takes a different approach in the recent "radix tree
>> arena map for obmalloc" thread here. ...

> One open question is the cache efficiency of the two approaches.
> Intuitively, address_in_range() will often look at exactly the same
> cache line (since a significant number of allocation will share the
> same "page prefix").

I believe we haven't seen a program yet that used more than one node
at the tree's top level :-)  But who knows?  mmap() and VirtualAlloc()
don't appear to make any _guarantees_ that the high-order bits of
returned addresses aren't entirely random.  In real life so far, they
always appear to be zeroes.

While x86 has a 64-bit virtual address space, the hardware "only"
supports 48 bits of physical address space, and I haven't seen a
virtual address yet where any of the top 16 bits are set.

AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47.

Blah blah blah - for the foreseeable future, the top level of the tree
has a very easy job.

And Neil keenly observed that the top level of the tree can be
_declared_  as being very broad (and so suck up a lot of the leading
bits), because it's a file static and is effectively just an address
space reservation (at least on Linux) until nodes in it actually get
used.

>  Apparently, this benefit may be offset by cache
> aliasing issues.  Cache aliasing can also be mitigated by the fact that
> CPU caches are most of the time N-way with N > 1 (but N generally
> remains small, from 2 to 8, for L1 and L2 caches).
>
> So I guess the a priori answer is "it's complicated" :-)

Indeed it is.

> I must also thank both you and Neil for running these experiments, even
> though sometimes I may disagree about the conclusions :-)

Well, there aren't any conclusions yet - just seeing the same things repeatedly.

Over the weekend, Neil ran many variations of a longish-running "real
job" related to his work, which goes through phases of processing bulk
database operations and "trying to" release the space (details are
complicated).

Arena recycling was essentially non-existent in either branch (my PR
or the radix tree).

In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but
on closer examination they were virtually all of the "useless arena
thrashing" flavor.  The arenas in use were almost always within one of
the highwater mark.

But it got much better in the branches if arenas were shrunk to a tiny 8 KiB.

Which is just another instance of the "256 KiB arenas are already way
too fat for effective arena recycling unless the program is
exceptionally friendly in its dynamic memory use patterns"
observation.

We haven't seen a case yet where 1 MiB arenas do materially worse than
256 KiB ones in this respect.

Speed is generally a wash between the branches, although they
consistently appear to be faster (by a little, not a lot) than 3.7.3.

The radix tree generally appears to be a little more memory-frugal
than my PR (presumably because my need to break "big pools" into 4K
chunks, while the tree branch doesn't, buys the tree more space to
actually store objects than it costs for the new tree).

We need more "real jobs", and especially ones where arena recycling is
actually doing good in 3.7.3 (wh

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-23 Thread Tim Peters

[Tim]
> The radix tree generally appears to be a little more memory-frugal
> than my PR (presumably because my need to break "big pools" into 4K
> chunks, while the tree branch doesn't, buys the tree more space to
> actually store objects than it costs for the new tree).

It depends a whole lot on the size classes of the most popular
objects.  A program below to compute it all.  For a 64-bit box using
3.8 alignment, and 16 KiB pools:

pages per pool 4
pool size 16,384
alignment 16

The first output block:

size 16
SQ 1012  1.2%
PR 1018  0.6%
RT 1021  0.3%

SQ is the status quo:  we have to use four separate 4 KiB pools, and
each burns 48 bytes for a pool header.

PR:  my PR.  There's one pool spanning 4 pages, with 48 bytes for a
pool header in the first page, and 16 bytes to store the arena index
in each of the other 3 pages.

RT:  the radix tree.  One 16 KiB block that only "wastes" 48 bytes for
the pool header.

The first number on each line is the number of objects we can get from
a "big pool" under that scheme.  The second number is the % of total
pool space that cannot be use for client objects.

So, in the above, PR is a substantial improvement over SQ, and RT a
less substantial improvement over PR.  Regardless of size class, PR
never does worse than SQ, and RT never worse than PR.

The most dramatic difference is in the largest size class:

size 512
SQ   28 12.5%
PR   28 12.5%
RT   31  3.1%

RT is a huge win there.  And while it's generally true that RT's
advantages are more striking in the larger size classes, it doesn't
always crush.  For example, in the 2nd-largest size class, it doesn't
matter at all which scheme is used:

size 496
SQ   32  3.1%
PR   32  3.1%
RT   32  3.1%

However, in the 3rd-largest size class, RT crushes it again:

size 480
SQ   32  6.2%
PR   32  6.2%
RT   34  0.4%

I trust the general principle at work here is too obvious to need
explanation ;-)

Anyway, these differences can really add up when there are millions of
objects passed out from a size class where RT has an advantage.
That's very attractive to me.

On the other hand, this _effectively_ make arenas even larger (they
can contain more objects), which apparently makes it even less likely
that arenas can eventually be returned to the system.

But, on the third hand, I've seen no evidence yet that increasing
arena size matters much at all to that after hitting 128 KiB arenas
(smaller than what we already use).  "Uncooperative" programs
essentially recycle nothing regardless, and "happy" programs
essentially recycle almost as many arena bytes with 1 MiB arenas as
with 8 KiB arenas.

Here's the code:

PAGES_PER_POOL = 4
ALIGNMENT = 16 # change to 8 for < Python 3.8

PAGE_SIZE = 2**12
POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL
POOL_HEADER_SIZE = 48

def from_block(size, blocksize, overhead):
return (blocksize - overhead) // size

def from_first_page(size, *, pagesize=PAGE_SIZE):
return from_block(size, pagesize, POOL_HEADER_SIZE)

# using multiple 4K one-page pools - status quo
def nobj_4K(size):
return  from_first_page(size) * PAGES_PER_POOL

# using the PR
def nobj_PR(size):
return (from_first_page(size) +
from_block(size, PAGE_SIZE, ALIGNMENT)
* (PAGES_PER_POOL - 1))

# using the radix tree branch
def nobj_RT(size):
return from_first_page(size, pagesize=POOL_SIZE)

print("pages per pool", PAGES_PER_POOL)
print(f"pool size {POOL_SIZE:,}")
print("alignment", ALIGNMENT)

for size in range(ALIGNMENT, 512 + 1, ALIGNMENT):
print("size", size)
for tag, f in (("SQ", nobj_4K),
   ("PR", nobj_PR),
   ("RT", nobj_RT),
  ):
nobj = f(size)
waste = POOL_SIZE - nobj * size
print(f"{tag} {nobj:4} {waste/POOL_SIZE:5.1%}")
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S5KMU6M6GZACRNFCF3TNPE7NKDCMQD5E/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-06-24 Thread Tim Peters

[Antoine Pitrou ]
> For the record, there's another contender in the allocator
> competition now:
> https://github.com/microsoft/mimalloc/

Thanks! From a quick skim, most of it is addressing things obmalloc doesn't:

1) Efficient thread safety (we rely on the GIL).

2) Directly handling requests of all sizes (we only deal with <= 512 bytes).

3) Funky stuff to help refcounting languages that want to guarantee
that freeing an object takes bounded time (the problem is that large
objects can contain an unbounded number of other objects that
will then also die - so mimalloc has complications to interact with
client-supplied functions that parcel out the tree of objects killed
by a decref, so the freeing can be done incrementally over time).

I don't think it's been explicitly pointed out before, but #2 is an
alternative to my PR and Neil's radix tree. If we insist that
programs play by the rules, and obmalloc controls every piece of
memory it hands out, then the new definition of address_in_range()
becomes

#define address_in_range(p, pool) true

;-)

This deserves more thought than I've ever given to it.

Leaving aside all the above, they emphasize this:

"""
Many allocators use a single free list per size class which can lead
to bad spatial locality where objects belonging to a single structure
can be spread out over the entire heap.
...
To improve the spatial locality of allocation, mimalloc use free list
sharding where the heap is divided into pages (per size-class) with a
free list per page (where pages are usually 64KiB for small objects).
"""

Which is exactly what obmalloc already does, except we call their
"pages" "pools", and ours are 16x smaller (<= Python 3.8) or 4x
smaller (my PR).

The call our "arenas" "segments", and theirs are a minimum of 4 MiB.

Their motivations for making these "large" are the same as mine:
maximize the time they can stay in the very zippy "fast paths".

Something I'm torn on: within a pool, obmalloc has two ways to find
the next free block. The pool has its own block free list (as in
mimalloc), but _also_ has a "bump allocator": The pool is carved into
blocks one at a time, low address to high, whenever the pool's free
list is NULL. This was part of Vladimir's desire to never touch a
piece of memory before it's actually needed to satisfy a client
request.

But it does complicate and slow the code on the second-fastest
allocation path (when the pool's free list _is_ NULL). The
conditional "OK, the free list is NULL - so is there room for another
bump allocation?" is a waste of time after all the pool's blocks have
been passed out at least once. The bump allocator can never succeed
again then, but we keep testing for it forever.

The alternative is to break the pool into blocks, and link them into a
free list, in one gulp when a size class is assigned to a free pool.
That mutates all the blocks in the pool (to store the list pointers),
even if all but one will never be used. In return, the code for
testing whether the bump allocator can find room, and the pool header
members supporting the bump allocator, could be thrown out. That's
what mimalloc appears to do, and they said it sped things up a bit
overall.

But I think I'll leave that one for someone else ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/RSBCIDD6YBDSEPQIBGTOZHVS63PS7LTU/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Tim Peters

[Antoine Pitrou ]
>> Ah, interesting.  Were you able to measure the memory footprint as well?

[Inada Naoki ]
> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
> benchmarks.  I will look it later.
>
> ```
> $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> Slower (60):
> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> slower (+158%)
> ...

Could you say more about what's being measured here?  Like, if this is
on Linux, is it reporting RSS?  VSS?  Something else?

mimalloc is "most like" obmalloc among those I've looked at in recent
weeks.  I noted before that its pools (they call them "pages") and
arenas (called "segments") are at least 16x larger than obmalloc uses
(64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size,
in mimalloc).

Like all "modern" 64-bit allocators, it cares little about reserving
largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom.
But it also releases (in some sense, ;like MADV_FREE) physical RAM
back to the system at a granularity far smaller than arena'segment.

At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB
vector at the start, so no program linking to that can consume less
than half a gig of address space.  But it only expects to _use_ a few
4 KiB OS pages out of that.  mimalloc doesn't do anything anywhere
near _that_ gonzo (& since mimalloc comes out of Microsoft, it never
will - "total virtual memory" on Windows is a system-wide resource,
limited to the sum of physical RAM + pagefile size - no "overcommit"
is allowed).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-04 Thread Tim Peters

[Victor Stinner ]
> I guess that INADA-san used pyperformance --track-memory.
>
> pyperf --track-memory doc:
> "--track-memory: get the memory peak usage. it is less accurate than
> tracemalloc, but has a lower overhead. On Linux, compute the sum of
> Private_Clean and Private_Dirty memory mappings of /proc/self/smaps.
> ...

So I'll take that as meaning essentially that it's reporting what RSS
would report if it ignored shared pages (so peak # of private pages
actually backed by physical RAM).

Clear as mud how MADV_FREE affects that.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6H7HP6TXKGZBIPVNBTLULGYEDFJKVFCQ/

[Python-Dev] Re: Removing dead bytecode vs reporting syntax errors

2019-07-05 Thread Tim Peters

[Pablo Galindo Salgado ]
> Recently, we moved the optimization for the removal of dead code of the form
>
> if 0:
>   
>
> to the ast so we use JUMP bytecodes instead (being completed in PR14116). The
> reason is that currently, any syntax error in the block will never be 
> reported.
> For example:

"Any syntax error" is way overstated.  "Almost all" syntax errors will
be reported.  The ones that won't be aren't "really" about syntax (as
most people view it). but more restrictions on the context in which
certain statements can appear:

> if 0:
>   return
>
> if 1:
> pass
> else:
> return
>
> while 0:
> return
>
>
> at module level do not raise any syntax error (just some examples),

Whereas in function scope, those are all fine, with "0" or "1".  It's
the "module level" context that matters to those.  Regardless of
context, a syntax error that's actually about syntax ;-) is reported:

if 0:
x +


> In https://bugs.python.org/issue37500 it was reported
> that after that, code coverage will decrease as coverage.py sees these
> new bytecodes (even if they are not executed). In general,
> the code object is a bit bigger and the optimization now it
> requires an JUMP instruction to be executed, but syntax errors
> are reported.

And I added other gripes to the report.

I have one file (temp.py) using "if 0:" over 400 times.  When I need
to write a small program or example (for, e.g. a StackOverflow
answer), I add a new "if 1:" block at the end, fiddle with it until
it's ready, then change the "1:" to "0:".  So the snippets stay around
forever, but are effectively commented out so have no effect
(unless/until they're needed again).

There are, of course, other ways to comment them out, but none so
convenient.  For example, under the "if 1:" block, the code is already
indented 4 spaces, so can be pasted exactly as-is into a StackOverflow
answer.

But it doesn't really matter what I happen to do:  I'm just one of
millions of users, who have come to rely on this stuff for way over a
decade.

> The discussion on issue 37500 is about if we should prioritize the 
> optimization
> or the correctness of reporting syntax errors. In my opinion,

Which doesn't really make sense unless it's logically _necessary_ to
"pick just one - you can't have both".

> SyntaxErrors should be reported with independence of the value of variables
> (__debug__) or constant as is a property of the code being written not of the
> code being executed. Also, as CPython is the reference implementation of 
> Python,
> the danger here is that it could be interpreted that this optimization is 
> part of the
> language and its behavior should be mirrored in every other Python 
> implementation.

It's been that way for at least 15 years (since Python 2.4, so it's
hard to be worried about that now.  Indeed, Armin Rigo posted the
original example, and he wasn't fooled a bit about intent ;-)

> Elsewhere we have always prioritize correctness over speed or optimizations.

When they're irreconcilably in conflict, and the cost of correctness
isn't "just too much", sure.


> I am writing this email to know what other people think. Should we revert the 
> change
> and not report Syntax Errors on optimized blocks? Someone sees a viable way of
> reporting the errors and not emitting the bytecode for these block?

We should revert the change until someone (not me ;-) ) thinks harder
about whether it's possible to get both.  There doesn't seem to be,
e.g., a conceptual problem with simply throwing away "if 0:" subtrees
from the AST before generating code, or from snipping the "if 1:"
guard off an "if 1:" block.

You started your msg by saying "we moved the optimization", but that's
not so:  the optimization was eliminated.  So just finish "moving" it
;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ULCEMMNQQ657LFEK6C5OX3S7ZL4YAV4E/

[Python-Dev] Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)

2019-07-08 Thread Tim Peters

[Inada Naoki , trying mimalloc]
>>> Hmm, it is not good.  mimalloc uses MADV_FREE so it may affect to some
>>> benchmarks.  I will look it later.

>> ...
>> $ ./python  -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
>> Slower (60):
>> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
>> slower (+158%)
>> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
>> slower (+127%)

> I think I understand why the mimalloc uses more than twice memory than the
> pymalloc + glibc malloc in logging_format and logging_simple benchmarks.
>
> These two benchmarks does like this:
>
> buf = [] # in StringIO
> for _ in range(10*1024):
> buf.append("important: some important information to be logged")
> s = "".join(buf)  # StringIO.getvalue()
> s.splitlines()
>
> mimalloc uses size segregated allocator for ~512KiB.  And size class
> is determined top three bits.
> On the other hand, list increases capacity by 9/8.  It means next size
> class is used on each realloc.

Often, but not always (multiplication by 9/8 may not change the top 3
bits - e.g., 128 * 9/8 = 144).

>  At last, all size classes has1~3 used/cached memory blocks.

No doubt part of it, but hard to believe it's most of it.  If the loop
count above really is 10240, then there's only about 80K worth of
pointers in the final `buf`.  To account for a difference of over 10M,
it would need to have left behind well over 100 _full_ size copies
from earlier reallocs.

in fact the number of list elements across resizes goes like so:

0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945

Adding all of those sums to 96,113, so accounts for less than 1M of
8-byte pointers if none were ever released.  mimalloc will, of course,
add its own slop on top of that - but not a factor of ten's worth.
Unless maybe it's using a dozen such buffers at once?

But does it really matter? ;-)  mimalloc "should have" done MADV_FREE
on the pages holding the older `buf` instances, so it's not like the
app is demanding to hold on to the RAM (albeit that it may well show
up in the app's RSS unless/until the OS takes the RAM away).


> This is almost worst case for mimalloc.  In more complex application,
> there may be more chance to reuse memory blocks.
>
> In complex or huge application, this overhead will become relatively small.
> It's speed is attractive.
>
> But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for
> common cases.

The mimalloc page says that, in their benchmarks:

"""
In our benchmarks (see below), mimalloc always outperforms all other
leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses
less memory (up to 25% more in the worst case).
"""

obmalloc is certainly more "memory efficient" (for some meanings of
that phrase) for smaller objects:  in 3.7 it splits objects of <= 512
bytes into 64 size classes.  mimalloc also has (close to) 64 "not
gigantic" size classes, but those cover a range of sizes over a
thousand times wider (up to about half a meg).  Everything obmalloc
handles fits in mimalloc's first 20 size classes.  So mimalloc
routinely needs more memory to satisfy a "small object" request than
obmalloc does.

I was more intrigued by your first (speed) comparison:

> - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)

Now _that's_ interesting ;-)  Looks like spectral_norm recycles many
short-lived Python floats at a swift pace.  So memory management
should account for a large part of its runtime (the arithmetic it does
is cheap in comparison), and obmalloc and mimalloc should both excel
at recycling mountains of small objects.  Why is mimalloc
significantly faster?  This benchmark should stay in the "fastest
paths" of both allocators most often, and they both have very lean
fastest paths (they both use pool-local singly-linked sized-segregated
free lists, so malloc and free for both should usually amount to just
popping or pushing one block off/on the head of the appropriate list).

 obmalloc's `address_in_range()` is definitely a major overhead in its
fastest `free()` path, but then mimalloc has to figure out which
thread is doing the freeing (looks cheaper than address_in_range, but
not free).  Perhaps the layers of indirection that have been wrapped
around obmalloc over the years are to blame?  Perhaps mimalloc's
larger (16x) pools and arenas let it stay in its fastest paths more
often?  I don't know why, but it would be interesting to find out :-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/554D4PU6LBBIKWJCQI4VKU2BVZD4Z3PM/

[Python-Dev] Re: Optimizing pymalloc (was obmalloc

2019-07-09 Thread Tim Peters

[Inada Naoki ,
 looking into why mimalloc did so much better on spectral_norm]
> I compared "perf" output of mimalloc and pymalloc, and I succeeded to
> optimize pymalloc!
>
> $ ./python bm_spectral_norm.py --compare-to ./python-master
> python-master: . 199 ms +- 1 ms
> python: . 182 ms +- 4 ms
>
> Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +-
> 4 ms: 1.10x faster (-9%)

Yay!!  Nice :-)


> mimalloc uses many small static (inline) functions.

Too many , if you ask me ;-)  Really, the enormous number of tiny
functions makes the code very hard to follow for me.  OTOH, the tiny
number of enormous functions in pymalloc makes that hard to follow too
:-(

> On the other hand, pymalloc_alloc and pymalloc_free is large function
> containing slow/rare path.

obmalloc.c was written when compiler inlining was barely a thing.
Some restructuring is a fine idea - overdue, in fact.


> PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines 
> pymalloc_free.
> But compiler doesn't know which is the hot part in pymalloc_alloc and
> pymalloc_free.
> So gcc failed to chose code to inline.  Remaining part of
> pymalloc_alloc and pymalloc_free
> are called and many push/pop are executed because they contains complex logic.
>
> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
> But I need to use
> "static inline" for pymalloc_alloc and pymalloc_free yet [1].
> Generated assembly is optimized
> well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3].
> But there are many code duplication in PyObject_Malloc and
> PyObject_Calloc, etc...
>
> [1] https://github.com/python/cpython/pull/14674/files
> [2] 
> https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L232-L274
> [3] 
> https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L2-L32
>
> I will try to split pymalloc_alloc and pymalloc_free to smaller functions.

Yes, splitting out the slower paths should be a win.

Note this is one reason I remain keen to increase pool size:  the
bigger the pools, the more objects can be handed out & reclaimed
_from_ the fastest paths.  You've now discovered that "the fastest
paths" are, for pragmatic compiler-is-overwhelmed reasons,
significantly slower than they "should be".


> Except above, there is one more important difference.
>
> pymalloc return free pool to freepool list soon when pool become empty.

Fine by me, & I think it should continue to do so.  Details below.

> On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc)
> not everytime when it's empty [4].  So they can avoid rebuilding linked list 
> of
> free blocks.

pymalloc never returns anything to the system at smaller than "arena"
granularity, so it's not a big deal at all _merely_ to link and unlink
a pool to/from an arena's freepool list.  There's no possibility than
any byte in the pool can change while the pool is sitting on a
freepools list.

If we freed the last pool of a given size class, and next need to
allocate another object of that size class (most common), it's cheap:
when the pool is unlinked from the pool free list, it sees that the
pool's last size class is the same as the size class being requested,
and skips almost all pool initialization (the pool is already fully
prepared for objects of this size class).

init_pool:
/* Frontlink to used pools. */
next = usedpools[size + size]; /* == prev */
pool->nextpool = next;
pool->prevpool = next;
next->nextpool = pool;
next->prevpool = pool;
pool->nalloc = 1;
if (pool->szidx == size) {//   HERE xxxc
/* Luckily, this pool last contained blocks
 * of the same size class, so its header
 * and free list are already initialized.
 */
bp = pool->freeblock;
assert(bp != NULL);
pool->freeblock = *(block **)bp;
goto success;// xxx FASTEST PATH xx
}
   // and here there's a mound of code in case
   // the size classes don't match, including code
   // to restart the process of linking the pool's
   // blocks into a free list.

So In the common case, the pool's free list is reused exactly as-is at
the time the pool was linked to the freepool list.

> I think pymalloc should do same optimization.

As above, I believe pymalloc is already optimal in this case,

On the other end, something to consider:  when a block is freed from
an entirely used pool, the pool now contains one free block, and is
linked to the front of usedpools[sizeclass].  So the next request for
a block of that size will come from that pool, which will again cause
the pool to become entirely used.

That's a kind of possible thrashing, and acts against mimalloc's
(sane, to my eyes) desire to _entirely_ use up a given pool ("page")
once it starts pass

[Python-Dev] Re: Optimizing pymalloc (was obmalloc

2019-07-10 Thread Tim Peters

[Inada Naoki]
>> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
>> But I need to use
>> "static inline" for pymalloc_alloc and pymalloc_free yet [1].

[Neil Schemenauer]
> I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO
> enabled.

I like adding those regardless of whether compilers find them helpful:
 they help _people_ reading the code focus on what's important to
speed.  While not generally crucial, speed is important in these very
low-level, very heavily used functions.

Speaking of which, another possible teensy win:  pymalloc's allocation
has always started with:

if (nbytes == 0) {
return 0;
}
if (nbytes > SMALL_REQUEST_THRESHOLD) {
return 0;
}
size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;

But it could be a bit leaner:

size_t fatsize = (nbytes - 1) >> ALIGNMENT_SHIFT;
 if (UNLIKELY(fatsize >= NB_SMALL_SIZE_CLASSES)) {
 return 0;'
 }
size = (uint)fatsize;

The `nbytes == 0` case ends up mapping to a very large size class
then, although C may not guarantee that.  But it doesn't matter:  if
it maps to "a real" size class, that's fine.  We'll return a unique
pointer into a pymalloc pool then, and "unique pointer" is all that's
required.

An allocation requesting 0 bytes does happen at times, but it's very
rare.  It just doesn't merit its own dedicated test-&-branch.

> Good work looking into this.  Should be some relatively easy
> performance win.

Ditto!
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RE44X7IP464I4KDPJPG3LF5NV5P27DHU/

[Python-Dev] Can someone please finish merging GH-13482?

2019-08-06 Thread Tim Peters

https://github.com/python/cpython/pull/13482

is a simple doc change for difflib, which I approved some months ago.
But I don't know the current workflow well enough to finish it myself.
Like:

- Does something special need to be done for doc changes?

- Since this is a 1st-time contributor, does it need a change to the ACKS file?

- Anything else?

I'm sure this is all obvious after you've done it once, but I haven't.
So I'll watch and do it myself next time ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TFTQCJ7IZNCE3XUWEC4IVHPXBDZIKVEI/

[Python-Dev] Re: Can someone please finish merging GH-13482?

2019-08-06 Thread Tim Peters

[Mariatta ]
- Since this is a 1st-time contributor, does it need a change to the ACKS
file?

>
> I think the change is trivial enough, the misc/acks is not necessary.
>
> - Anything else?
>
>
> 1. Does it need to be backported? If so, please add the "needs backport to
> .." label.
>
> 2. Add the "🤖 automerge" label. The bot will take care of merging. It
> will use the PR title and description as commit message.
> If the PR title/ description is not good enough as commit message, edit it
> first before adding the "🤖 automerge" label.
>
> So I'll watch and do it myself next time ;-)
>
>
> Hopefully the above instructions are clear and simple enough, so I'll let
> you trigger the automerge this time.
>

So I tried ;-)  No idea what did and didn't work.  I got several of these
messages:

"""
Sorry, I can't merge this PR. Reason: Base branch was modified. Review and
try the merge again..
"""

I'm guessing (don't know) those are about failed auto-backports.  I don't
have time now to try to figure it all out - It seemed to have spawned a
number of new GH PRs for the backports, and I'm thoroughly lost now :-(

The change did show up when I pulled from upstream master, so I expect that
part did work.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SHIDYDSL43D6ANHGBCGVPMHSGCU5XK2P/

[Python-Dev] Re: Can someone please finish merging GH-13482?

2019-08-07 Thread Tim Peters

[Brett Cannon ]
> We probably need to update https://devguide.python.org/committing/ to
> have a step-by-step list of how to make a merge works and how to
> handle backports instead of the wall of text that we have. (It's already
> outdated anyway, e.g. `Misc/ACKS` really isn't important as git itself
> records the author of the commit and so that can be used with
> `Misc/ACKS` for old commits to gather the list of folks who have"
> contributed.)

Don't put too much weight on my screwups ;-)  I was appalled to hear
that the OP's contribution was still sitting unmerged, and was in a
hurry to resolve that at a time I _had_ no significant time to give to
it.

Mariatta and Terry Reedy finished up what I left undone, so in the end
it's all good :-)

And there's a problem with the GitHub workflow docs that may be unique
to me:  we have helpful layers of automation, but they're shortening a
git workflow I don't understand even without the automation.  With the
automation, I'm just doubly clueless.

That's because I'm old.  My capacity to give a rip about source
control system quirks was apparently entirely used up when I spent
several weeks mastering the intricacies of Mercurial.  Try as I might,
I just haven't been able to force myself to become competent with git.
It's not that I disapprove of git!  It's apparently more that we're
each born with a finite capacity for being _able_ to learn Yet Another
New Source Control System, and mine was used up on YANSCS #8 ;-)

aging-isn't-for-the-optimistic-ly y;rs  - tim
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ACAJVCPHE2N5RNI4BNL2JK6INMJRPZ2N/

[Python-Dev] Re: PEP 572 TargetScopeError

2019-08-08 Thread Tim Peters

[Barry Warsaw ]
> bpo-37757: https://bugs.python.org/issue37757

Really couldn't care less whether it's TargetScopeError or
SyntaxError, but don't understand the only rationale given here for
preferring the latter:

> To me, “TargetScopeError” is pretty obscure and doesn’t give users an
> obvious clue as to what’s going wrong, without searching the interwebs.

Whereas SyntaxError would give no clue whatsoever, and nothing useful
to search for.  In contrast, a search for TargetScopeError would
presumably find a precisely relevant explanation as the top hit
(indeed, it already does today).

I don't care because it's unlikely an error most people will make at
all - or, for those too clever for their own good, make more than once
;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WBWCV7PBDYY6PVSXIX2QX7GKXY54PREU/

[Python-Dev] Re: PEP 572 TargetScopeError

2019-08-09 Thread Tim Peters

[Guido]
> I don't see how this debate can avoid a vote in the Steering Council.

FWIW, I found Nick's last post wholly persuasive:  back off to
SyntaxError for now, and think about adding a more specific exception
later for _all_ cases (not just walrus) in which a scope conflict
isn't allowed (e.g., his example of declaring a function's formal
argument `global` in the function).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7ILVNCFSYMJ4HP5INVA34N4DVEKSEYAU/

[Python-Dev] Re: Spammy stuff (was Re: congrats on 3.5! Alas, windows 7 users are having problems installing it)

2019-09-15 Thread Tim Peters

[This is about the mailing list, not about Python development]

That python-dev-owner has gotten two complaints about this message so
far suggests I should explain what's going on ;-)

New list members are automatically moderated. Their posts sit in a
moderation queue waiting for moderator action.

While python-dev has several "official" moderators, best I can tell
I'm the only one who has reviewed these messages for years.

So - yup! - I approved this spam. The Subject was legit, the email
address was from Google rather than, e.g., some obscure Russian host,
and the link was to a site with an obviously computer-related name.
"Good enough". Usually, but not in this case.

So, sorry, but sometimes one of these just will slip through. I've
since fiddled the list to auto-discard future posts from this address,
but have no ability to purge the spam from the list archives.

There are other not-spam but "low quality" messages I've let through
too, since the list migrated to Mailman 3. That's because we lost
some very useful moderator functionality (which I've griped about on a
Mailman 3 issue tracker):

- It's no longer possible to forward a message from the queue to a
different address. So, e.g., rather than throw them away, I've let
through some newbie messages that I would previously have forwarded to
the Python help list.

- It's no longer possible to write custom text for a "Reject" action.
So, again, rather than throw away messages that should obviously be
posted to a different list, with the author getting nothing but a
generic explanation-free "rejected" message, I've let them through.

Yes, I _could_ use my own my email account and copy/paste/fiddle to
forward/reply in such cases, independent of Mailman functionality.
And sometimes I do. But there's only so much time I can give to this,
and some days I settle for clearing the queue just as quickly as
Tim-ly possible.

Some of that will get better if/when Mailman 3 restores lost
functionality, but I doubt anything can be done that stops spam 100%
short of erring on the side of rejecting marginally legit messages
too. But that's not how my brain is wired, so you would need a
different moderator if that's what you want.

In the meantime, I'll continue cashing my lavish moderator checks and
just laugh at complaints ;-)

On Sun, Sep 15, 2019 at 12:21 AM wrote:
>
> Check this article here which will help you out.
> [useless link]
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/3DPJTBS252J3XX5FMFRXTOFWYIUDEO6H/

[Python-Dev] Re: Spammy stuff (was Re: congrats on 3.5! Alas, windows 7 users are having problems installing it)

2019-09-15 Thread Tim Peters

[Tim]
> While python-dev has several "official" moderators, best I can tell
> I'm the only one who has reviewed these messages for years.

I should clarify that!  That's not meant to be a dig at the other
moderators.  I review everything because I'm retired and am near the
computer many hours almost every day.  That's been so for at least a
decade.  So the others rarely get a chance - the queue is very likely
empty if they ever look.  After a few years of that, they may well
stop looking ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZZXYE2PPVEURQ23CEOFYS5YPBJR5BDDT/

[Python-Dev] Re: Compacting the Uncompactable

2019-09-28 Thread Tim Peters

Short course:  a replacement for malloc for use in contexts that can't
"move memory" after an address is passed out, but want/need the
benefits of compactification anyway.

Key idea:  if the allocator dedicates each OS page to requests of a
specific class, then consider two pages devoted to the same class,
where `-` is free space, and X and Y are allocated chunks:

page1: X--XX-X---X
page2: --Y--Y--YY-

Because there's no overlap in the offsets of allocated blocks here, we
can copy the 4 Ys into page1 (or Xs into page2).  Then tell the OS to
change its page tables so that the virtual addresses of page1 amd
page2 _both_ map to the physical page page1 referred to.  page2's
physical page can be returned to the OS then.

No _virtual_ address malloc ever handed out needs to change.

The rest is making this all safe & "go fast".

On Sat, Sep 28, 2019 at 2:06 PM MRAB  wrote:
>
> Here's a video about memory fragmentation and compaction that you might
> find interesting:
>
> "Compacting the Uncompactable" by Bobby Powers
> https://www.youtube.com/watch?v=c1UBJbfR-H0
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-dev@python.org/message/G6OR45TETKIFZDVWAK5ZGLLFTIC422TG/
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VMP5RXVQV574SXNJC5SNIJNAL3PYZZC6/

[Python-Dev] Re: Macros instead of inline functions?

2019-12-04 Thread Tim Peters

[Skip Montanaro ]
> ...
> I don't think stable code which uses macros should be changed (though
> I see the INCREF/DECREF macros just call private inline functions, so
> some conversion has clearly been done). Still, in new code, shouldn't
> the use of macros for more than trivial use cases (constant defs,
> simple one-liners) be discouraged at this point?

Within reason, I think so.  Like C's old "register" declaration,
compilers will eventually evolve to make better decisions about what
should be done than humans generally do.

But there are macros that exist just to reduce repetitive, error-prone
typing, and others that set up control flow (like the trashcan macros.
or listobject.c's IFLT).  Which is the "within reason" part above.
There are still fine uses for macros in C.

It's just that faking inline functions is no longer one of them ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KGU63DELU4YJTVFALCP55SXQIJ2QN5WJ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-15 Thread Tim Peters

[Larry Hastings ]
> As of 3.7, dict objects are guaranteed to maintain insertion order.  But set
> objects make no such guarantee, and AFAIK in practice they don't maintain
> insertion order either.

If they ever appear to, it's an accident you shouldn't rely on.

>  Should they?

>From Raymond, 22 Dec 2017:

https://twitter.com/raymondh/status/944454031870607360
"""
Sets use a different algorithm that isn't as amendable to retaining
insertion order. Set-to-set operations lose their flexibility and
optimizations if order is required. Set mathematics are defined in
terms of unordered sets. In short, set ordering isn't in the immediate
future.
"""

Which is more an answer to "will they?" than "should they?" ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WMIZRFJZXI3CD5YMLOC5Z3LXVXD7HM4R/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-16 Thread Tim Peters

[Guido]
> ...
> the language should not disappoint them, optimization opportunities be damned.

I would like to distinguish between two kinds of "optimization
opportunities": theoretical ones that may or may not be exploited
some day, and those that CPython has _already_ exploited.

That is, we don't have a blank slate here. As Raymond said, the set
implementation already diverged from the dict implementation in
fundamental ways for "go fast" reasons. How much are people willing
to see set operations slow down to get this new feature?

For me, "none" ;-) Really. I have no particular use for "ordered"
sets, but have set-slinging code that benefits _today_ from the "go
fast" work Raymond did for them.

Analogy: it was always obvious that list.sort() is "better" stable
than not, but I ended up crafting a non-stable samplesort variant that
ran faster than any stable sort anyone tried to write. For years. So
we stuck with that, to avoid slowing down sorting across releases.

The stable "timsort" finally managed to be _as_ fast as the older
samplesort in almost all cases, and was very much faster in many
real-world cases. "Goes faster" was the thing that really sold it,
and so much so that its increased memory use didn't count for much
against it in comparison.

Kinda similarly, "ordered dicts" were (as has already been pointed
out) originally introduced as a memory optimization ("compact" dicts),
where the "ordered" part fell out more-or-less for free. The same
approach wouldn't save memory for sets (or so I'm told), so the
original motivation for compact dicts doesn't carry over.

So I'm +1 on ordered sets if and only if there's an implementation
that's at least as fast as what we already have. If not now, then,
like ordered dicts evolved, offer a slower OrderedSet type in the
`collections` module for those who really want it, and wait for magic
;-)

BTW, what should

{1, 2} | {3, 4, 5, 6, 7}

return as ordered sets? Beats me.; The imbalance between the
operands' cardinalities can be arbitrarily large, and "first make a
copy of the larger one, then loop over the smaller one" is the obvious
way to implement union to minimize the number of lookups needed. The
speed penalty for, e.g., considering "the left" operand's elements'
insertion orders to precede all "the right" operand's elements'
insertion orders can be arbitrarily large.

The concept of "insertion order" just doesn't make well-defined sense
to me for any operation the produces a set from multiple input sets,
unless it means no more than "whatever order happens to be used
internally to add elements to the result set". Dicts don't have
anything like that, although dict.update comes close, but in that case
the result is defined by mutating the dict via a one-at-a-time loop
over the argument dict.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/YXSGUPZYTO7TKOVVU32276M54TMITVVQ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-16 Thread Tim Peters

[Tim]
> BTW, what should
>
> {1, 2} | {3, 4, 5, 6, 7}
>
> return as ordered sets?  Beats me.;

[Larry]
> The obvious answer is {1, 2, 3, 4, 5, 6, 7}.

Why?  An obvious implementation that doesn't ignore performance entirely is:

def union(smaller, larger):
if len(larger) < len(smaller):
smaller, larger = larger, smaller
result = larger.copy()
for x in smaller:
   result.add(x)

In the example, that would first copy {3, 4, 5, 6, 7}, and then add 1
and 2 (in that order) to it, giving {3, 4, 5, 6, 7, 1, 2} as the
result.

If it's desired that "insertion order" be consistent across runs,
platforms, and releases, then what "insertion order" _means_ needs to
be rigorously defined & specified for all set operations.  This was
comparatively trivial for dicts, because there are, e.g., no
commutative binary operators defined on dicts.

If you want to insist that `a | b` first list all the elements of a,
and then all the elements of b that weren't already in a, regardless
of cost, then you create another kind of unintuitive surprise:  in
general the result of "a | b" will display differently than the result
of "b | a" (although the results will compare equal), and despite that
the _user_ didn't "insert" anything.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OU37ZU46BCHI6HLA7E3NEWCDOLQOHRNF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-16 Thread Tim Peters

[Raymond]
> ...
> * The ordering we have for dicts uses a hash table that indexes into a 
> sequence.
> That works reasonably well for typical dict operations but is unsuitable for 
> set
> operations where some common use cases make interspersed additions
> and deletions (that is why the LRU cache still uses a cheaply updated doubly-l
> linked list rather that deleting and reinserting dict entries).

I'm going to take a stab at fleshing that out a bit:  ideally, an
ordered dict stores hash+key+value records in a contiguous array with
no "holes".  That's ("no holes") where the memory savings comes from.
The "holes" are in the hash table, which only hold indices _into_ the
contiguous array (and the smallest width of C integer indices
sufficient to get to every slot in the array).

"The problem" is that deletions leave _two_ kinds of holes:  one in
the hash table, and another in the contiguous vector.  The latter
holes cannot be filled with subsequent new hash+key+value records
because that would break insertion order.

So in an app that mixes additions and deletions, the ordered dict
needs to be massively rearranged at times to squash out the holes left
behind by deletions, effectively moving all the holes to "the end",
where they can again be used to reflect insertion order.

Unordered dicts never had to be rearranged unless the total size
changed "a lot", and that remains true of the set implementation.  But
in apps mixing adds and deletes, ordered dicts can need massive
internal rearrangement at times even if the total size never changes
by more than 1.

Algorithms doing a lot of mixing of adds and deletes seem a lot more
common for sets than for dicts, and so the ordered dict's basic
implementation _approach_ is a lot less suitable for sets.  Or, at
least, that's my best attempt to flesh out Raymond's telegraphic
thinking there.

Note:  the holes left by deletions _wouldn't_ be "a problem" _except_
for maintaining insertion order.  If we were only after the memory
savings, then on deletion "the last" record in the contiguous array
could be moved into the hole at once, leaving the array hole-free
again.  But that also changes the order.  IIRC, that's what the
original "compact dict" _did_ do.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QQXSSJWHKTUNHSMSHVM7XLMDBMUV7BDX/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-16 Thread Tim Peters

[Petr Viktorin ]
> ...
> Originally, making dicts ordered was all about performance (or rather
> memory efficiency, which falls in the same bucket.) It wasn't added
> because it's better semantics-wise.

As I tried to flesh out a bit in a recent message, the original
"compact dict" idea got all the memory savings, but did _not_ maintain
insertion order.  Maintaining insertion order too complicated
deletions (see recent message), but was deliberately done because
people really did want "ordered dicts".

> Here's one (very simplified and maybe biased) view of the history of dicts:
>
> * 2.x: Dicts are unordered, please don't rely on the order.
> * 3.3: Dict iteration order is now randomized. We told you not to rely
> on it!
> * 3.6: We now use an optimized implementation of dict that saves memory!
> As a side effect it makes dicts ordered, but that's an implementation
> detail, please don't rely on it.
> * 3.7: Oops, people are now relying on dicts being ordered. Insisting on
> people not relying on it is battling windmills. Also, it's actually
> useful sometimes, and alternate implementations can support it pretty
> easily. Let's make it a language feature! (Later it turns out
> MicroPython can't support it easily. Tough luck.)

A very nice summary!  My only quibble is as above:  the "compact dict"
implementation doesn't maintain insertion order naturally, _unless_
there are no deletions (which is, e.g., true of dicts constructed to
pass keyword arguments).  The code got hairier to maintain insertion
order in the presence of mixing insertions and deletions.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TRVOCOLQGSOM2OLLAI3UPRCTFIKIWWH6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-16 Thread Tim Peters

[Tim]
>> If it's desired that "insertion order" be consistent across runs,
>> platforms, and releases, then what "insertion order" _means_ needs to
>> be rigorously defined & specified for all set operations.  This was
>> comparatively trivial for dicts, because there are, e.g., no
>> commutative binary operators defined on dicts.

[Larry]
> My intuition is that figuring out sensible semantics here is maybe not
> trivial, but hardly impossible.  If this proposal goes anywhere I'd be
> willing to contribute to figuring it out.

No, it _is_ easy.  It's just tedious, adding piles of words to the
docs, and every constraint also constrains possible implementations.
You snipped my example of implementing union, which you should really
think about instead ;-)


>> If you want to insist that `a | b` first list all the elements of a,
>> and then all the elements of b that weren't already in a, regardless
>> of cost, then you create another kind of unintuitive surprise:  in
>> general the result of "a | b" will display differently than the result
>> of "b | a" (although the results will compare equal), and despite that
>> the _user_ didn't "insert" anything.

> Call me weird--and I won't disagree--but I find nothing unintuitive about 
> that.
>  After all, that's already the world we live in: there are any number of sets
> that compare as equal but display differently.  In current Python:
>
> >>> a = {'a', 'b', 'c'}
> >>> d = {'d', 'e', 'f'}
> >>> a | d
> {'f', 'e', 'd', 'a', 'b', 'c'}
> >>> d | a
> {'f', 'b', 'd', 'a', 'e', 'c'}

Yup, it happens   But under the sample union implementation I gave, it
would never happen when the sets had different cardinalities (the
different sizes are used to force a "standard" order then).  For
mathematical sets, | is commutative (it makes no difference to the
result if the arguments are swapped - but can make a _big_ difference
to implementation performance unless the implementation is free to
pick the better order).

> ...
> This is also true for dicts, in current Python, which of course do maintain
> insertion order.  Dicts don't have the | operator, so I approximate the
> operation by duplicating the dict (which AFAIK preserves insertion order)

Ya, it does, but I don't believe that's documented (it should be).

> and using update.

Too different to be interesting here - update() isn't commutative.
For sets, union, intersection, and symmetric difference are
commutative.

> ...
> Since dicts already behave in exactly that way, I don't think it would be too
> surprising if sets behaved that way too.  In fact, I think it's a little 
> surprising
> that they behave differently, which I suppose was my whole thesis from
> the beginning.

I appreciate that dicts and sets behave differently in visible ways
now.  It just doesn't bother me enough that I'm cool with slowing set
operations to "fix that".
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/L2VGSHROPMW4BV6ORYMFKQ4ZJ5AXTZLE/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-16 Thread Tim Peters

[Larry]
> Didn't some paths also get slightly slower as a result of maintaining
> insertion order when mixing insertions and deletions?

I paid no attention at the time.  But in going from "compact dict" to
"ordered dict", deletion all by itself got marginally cheaper.  The
downside was the need to rearrange the whole dict when too many
"holes" built up.  "Compact (but unordered) dict" doesn't need that.

> My recollection is that that was part of the debate--not only "are we going to
> regret inflicting these semantics on posterity, and on other 
> implementations?",
> but also "are these semantics worth the admittedly-small performance hit, in
> Python's most important and most-used data structure?".

There's a layer of indirection in compact dicts - lookups are
distributed across two arrays.  In non-compact unordered dicts,
everything is in a single array.  Cache effects may or may not make a
measurable difference then, depending on all sorts of things.

> Also, I don't recall anything about us resigning ourselves to explicitly
> maintain ordering on dicts because people were relying on it, "battling
> windmills", etc.  Dict ordering had never been guaranteed, a lack
> of guarantee Python had always taken particularly seriously.  Rather, we
> decided it was a desirable feature, and one worth pursuing even at the
> cost of a small loss of performance.

I'm not convinced there was a loss of performance.  The delay between
the implementation supplying ordered dicts, and the language
guaranteeing it, was, I suspect, more a matter of getting extensive
real-world experience about whether the possible new need to massively
rearrange dict internals to remove "holes" would bite someone too
savagely to live with.  But, again, I wasn't paying attention at the
time.

> One prominent Python core developer** wanted this feature for years, and I 
> recall
> them saying something like:
>
> Guido says, "When a programmer iterates over a dictionary and they see the 
> keys
> shift around when the dictionary changes, they learn something!"  To that I 
> say--"Yes!
> They learn that Python is unreliable and random!"

I never wanted ordered dicts, but never objected to them either.  All
in all, given that _I_ haven't seen a performance degradation, I like
that they're more predictable, and love the memory savings.

But as Raymond said (& I elaborated on), there's reason to believe
that the implementation of ordered dicts is less suited to sets, where
high rates of mixing adds and deletes is more common (thus triggering
high rates of massive internal dict rearranging).  Which is why I said
much earlier that I'd be +1 on ordered sets only when there's an
implementation that's as fast as what we have now.

Backing up:

> Python is the language where speed, correctness, and readability trump
> performance every time.

Speed trumping performance didn't make any sense ;-)

So assuming you didn't intend to type "speed", I think you must have,
say, Scheme or Haskell in mind there.  "Practicality beats purity" is
never seen on forums for those languages.  Implementation speed & pain
have played huge rules in many Python decisions.  As someone who has
wrestled with removing the GIL, you should know that better than
anyone ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QK3KERIPYD3Q3XNKZJBQBQ6NUUKT63WN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-17 Thread Tim Peters

...

[Larry]
>> One prominent Python core developer** wanted this feature for years, and I 
>> recall
>> them saying something like:
>>
>> Guido says, "When a programmer iterates over a dictionary and they see the 
>> keys
>> shift around when the dictionary changes, they learn something!"  To that I 
>> say--"Yes!
>> They learn that Python is unreliable and random!"

[Tim]
> I never wanted ordered dicts, but never objected to them either.  All
> in all, given that _I_ haven't seen a performance degradation, I like
> that they're more predictable, and love the memory savings.

That reads like I was correcting Larry for misquoting _me_.  Sorry!
He wasn't quoting me, and I didn't think he was.  What he did quote
was delightfully snarky enough that I didn't want to snip it, but not
substantial enough to be worth the bother of addressing.  So I just
moved on to summarize what I said at the time (i.e., nothing).
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LXJU3EKB2KM3BT6MTGDLWVHTZUUIZGJK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-17 Thread Tim Peters

[Larry]
> "I don't care about performance" is not because I'm aching for Python to
> run my code slowly.  It's because I'm 100% confident that the Python
> community will lovingly optimize the implementation.

I'm not ;-)

>  So when I have my language designer hat on, I really don't concern myself
> with performance.  As I thought I said earlier in the thread, I think we 
> should
> figure out the semantics we want first, and then we figure out how to make it 
> fast.

Pretty much the opposite of how Python started.  The original lists
and dicts were deliberately designed to have dirt simple
implementations, in reaction against ABC's "theoretically optimal"
data structures that were a nightmare to maintain, and had so much
overhead to support "optimality" in all cases that they were, in fact,
much slower in common cases.

There isn't magic waiting to be uncovered here:  if you want O(1)
deletion at arbitrary positions in an ordered sequence, a doubly
linked list is _the_ way to do it.  That's why, e.g., Raymond said he
still uses a doubly linked list, instead of an ordered dict, in the
LRU cache implementation.  If that isn't clear, a cache can be
maintained in least-to-most recently accessed order with an ordered
dict like so:

if key in cache:
cached_value = cache.pop(key)  # remove key
   else:
compute cached_value
assert key not in cache
cache[key] = cached_value # most recently used at the end now
return cached_value

and deleting the least recently used is just "del
cache[next(iter(cache))]" (off topic, just noting this is a fine use
for the "first()" function that's the subject of a different thread).

We _could_ structure an ordered dict's hash+key+value records as a
doubly linked list instead (and avoid the need for O(N)
reorganizations).  But then we piss away much of the memory savings
(to store the new links) that was the _point_ of compact dicts to
begin with.

So there was a compromise.  No links, deletion on its own is O(1), but
can periodically require O(N) under-the-covers table reorganizations
to squash out the holes.  "Suitable enough" for ordered dicts, but
_thought_ to be unsuitable for ordered sets (which appear to have
higher rates of mixing deletions with insertions - the LRU cache
example being an exception).

But there are also other optimizations in the current set
implementation, so "fine, add the doubly linked list to sets but not
to dicts" is only part of it.

Which may or may not be possible to match, let alone beat, in an
ordered set implementation.  A practical barrier now is that Python is
too mature to bank on loving optimizations _after_ a change to a core
feature is released.  It's going to need a highly polished
implementation first.

I certainly don't object to anyone trying, but it won't be me ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/UXWGNLGC3BG2XSMYN5H73FVKP3O2XUUC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-19 Thread Tim Peters

[Nick Coghlan ]
> Starting with "collections.OrderedSet" seems like a reasonable idea,
> though - that way "like a built-in set, but insertion order preserving" will
> have an obvious and readily available answer, and it should also
> make performance comparisons easier.

Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it.

The problem is that the "use case" here isn't really a matter of
functionality, but of consistency:  "it would be nice if", like dicts
enjoy now, the iteration order of sets was defined in an
implementation-independent way.  That itch isn't scratched unless the
builtin set type defines it.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/PM5ENMLR665XG32AS2FEAEUVDG3AFWV6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-19 Thread Tim Peters

[Nick]
> I took Larry's request a slightly different way:

Sorry, I was unclear:  by "use case" I had in mind what appeared to me
to be the overwhelming thrust of the _entirety_ of this thread so far,
not Larry's original request.

> he has a use case where he wants order preservation (so built in sets aren't
> good), but combined with low cost duplicate identification and elimination and
> removal of arbitrary elements (so lists and collections.deque aren't good).

Sure.

> Organising a work queue that way seems common enough to me to be
> worthy of a stdlib solution that doesn't force you to either separate a
> "job id" from the "job" object in an ordered dict, or else use an ordered
> dict with "None" values.
>
> So while it means answering "No" to the "Should builtin sets preserve
> order?" question (at least for now), adding collections.OrderedSet *would*
> address that "duplicate free pending task queue" use case.

Only Larry can answer whether that would meet his perceived need.  My
_guess_ is that he wouldn't know OrderedSet existed, and, even if he
did, he'd use a dict with None values anyway (because it's less hassle
and does everything he wanted).

But I have to add that I don't know enough about his original use case
to be sure it wasn't "an XY problem":

https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
"""
The XY problem is asking about your attempted solution rather than
your actual problem.

That is, you are trying to solve problem X, and you think solution Y
would work, but instead of asking about X when you run into trouble,
you ask about Y.
"""

Because I've never had a "job queue" problem where an OrderedSet would
have helped.  Can't guess what's different about Larry's problem.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TV2XVX2NKUGJJZNFKO4TSMEPVQI6Y75Q/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-19 Thread Tim Peters

[David Mertz ]
> It's not obvious to me that insertion order is even the most obvious or
> most commonly relevant sort order.  I'm sure it is for Larry's program, but
> often a work queue might want some other order.  Very often queues
> might instead, for example, have a priority number assigned to them.

Indeed, and it makes me more cautious that claims for the _usefulness_
(rather than just consistency) of an ordered set are missing an XY
problem.  The only "natural" value of insertion-order ordering for a
dynamic ordered set is that it supports FIFO queues.  Well, that's one
thing that a deque already excels at, but much more obviously,
flexibly, and space- and time- efficiently.

For dicts the motivating use cases were much more "static", like
preserving textual order of keyword argument specifications, and
avoiding gratuitous changing of order when round-tripping key-value
encodings like much of JSON.

So what problem(s) would a dynamic ordered set really be aiming at?  I
don't know.  Consistency & predictability I understand and appreciate,
but little beyond that.  Even FIFO ordering is somewhat a PITA, since
`next(iter(set))` is an overly elaborate way to need to spell "get the
first" if a FIFO queue _is_ a prime dynamic use case.  And there's no
efficient way at all to access the other end (although I suppose
set.pop() would - like dict.popitem() - change to give a _destructive_
way to get at "the other end").
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/J52ATOHFXNBSEJV6QQZBFXSC2TAHOWGW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-19 Thread Tim Peters

[Nick]
> I must admit that I was assuming without stating that a full OrderedSet
> implementation would support the MutableSequence interface.

Efficient access via index position too would be an enormous new
requirement,  My bet:  basic operations would need to change from O(1)
to O(log(N)).

BTW, in previous msgs there are links to various implementations
calling themselves "ordered sets".  One of them supplies O(1)
indexing, but at the expense of making deletion O(N) (!):

https://pypi.org/project/ordered-set/

If efficient indexing is really wanted, then the original "use case"
Larry gave was definitely obscuring an XY problem ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/IBRSGUTHIMOZ6JGIYJBQJFXEANFZI4V5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-20 Thread Tim Peters

[Wes Turner ]
>> How slow and space-inefficient would it be to just implement the set methods
>> on top of dict?

[Inada Naoki ]
> Speed:  Dict doesn't cache the position of the first item.  Calling
> next(iter(D)) repeatedly is O(N) in worst case.
> ...

See also Raymond's (only) message in this thread.  We would also lose
low-level speed optimizations specific to the current set
implementation.  And we would need to define what "insertion order"
_means_ for operators (union, intersection, symmetric difference) that
build a new set out of other sets without a user-level "insert" in
sight.  However that's defined, it may constrain possible
implementations in ways that are inherently slower (for example, may
require the implementation to iterate over the larger operand where
they currently iterate over the smaller).

And the current set implementation (like the older dict
implementation) never needs to "rebuild the table from scratch" unless
the cardinality of the set keeps growing.  As Raymond telegraphically
hinted, the current dict implementation can, at times, in the presence
of deletions, require rebuilding the table from scratch even if the
dict's maximum size remains bounded.

That last can't be seen directly from Python code (rebuilding the
table is invisible at the Python level).  Here's a short example:

d = dict.fromkeys(range(3))
while True:
del d[0]
d[0] = None

Run that with a breakpoint set on dictobject.c's `dictresize()`
function.  You'll see that it rebuilds the table from scratch every
3rd time through the loop.  In effect, for the purpose of deciding
whether it needs to rebuild, the current dict implementation sees no
difference between adding a new element and deleting an existing
element   Deleting leaves behind "holes" that periodically have to be
squashed out of existence so that insertion order can be maintained in
a dead simple way upon future additions.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/BVCDKT5ULY324RZATMCEFGAMYOBJFHIZ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-20 Thread Tim Peters

[Inada Naoki ]
> I just meant the performance of the next(iter(D)) is the most critical part
> when you implement orderdset on top of the current dict and use it as a queue.

Which is a good point.  I added a lot more, though, because Wes didn't
even mention queues in his question:

 [Wes Turner ]
 How slow and space-inefficient would it be to just implement the set 
 methods
 on top of dict?

I can't guess what he was _really_ asking about, so now he's got lots
of answers ;-)

> This code should be O(N), but it's O(N^2) if q is implemented on top
> of the dict.
>
>while q:
>item = q.popleft()
>
> Sorry for the confusion.

To flesh this out, the FIFO queue entries are all together in a
contiguous vector, with no "holes" in the middle.  Pushing a new item
appends "to the right" (higher index in the vector), while popping an
item leaves "a hole" at the left.  But there are no holes "in the
middle" in this case.

So the first real entry is at a higher index with every pop, but
iteration starts at index 0 every time.  The more pops there are, the
more holes iteration has to skip over, one at a time, to find the
first real entry remaining.

Until pushing a new item would fall off the right end of the vector.
Then the table is rebuilt from scratch, moving all the real entries to
form a contiguous block starting at index 0 again.


>> And the current set implementation (like the older dict
>> implementation) never needs to "rebuild the table from scratch" unless
>> the cardinality of the set keeps growing.

> It is a bit misleading.  If "the cardinality of the set" means len(S),

Yes.

> set requires the rebuild in low frequency if the its items are random.
>
> Anyway, it is a smaller problem than next(iter(D)) because of the
> amortized cost is O(1).

For the specific case of FIFO queues, sure.  In general, though, "O(1)
period". is more desirable than "amortized O(1)".  This is especially
true when sets get very large.


> Current dict is not optimized for queue usage.

Nor should it be:-)  That's what collections.deque is for, pushes and
pops, at both ends, always time O(1) period.  No holes, no internal
rearrangements, and O(1) "wasted" space.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5VDU52JX2GFM6546W6FCFKXIODDAQKP4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

.
>>>
>>> Also, this is mostly speculation since I haven't ran any benchmarks for an 
>>> OrderedSet implementation, but I strongly suspect that OrderedSet will end 
>>> up having better average performance for add, pop, and update than trying 
>>> to use a dictionary with None values (assuming it's implemented in C or at 
>>> least with a C accelerator).
>>>
>>> Not to mention allowing the ability to update (for both addition and 
>>> removal) based on intersections and unions across ordered sets, which 
>>> currently isn't possible to do with dictionaries (at least not directly 
>>> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case 
>>> this would have for ordered sets specifically, but it's a nice bonus.
>>>
>>> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings  wrote:
>>>>
>>>>
>>>> (mixing together messages in the thread, sorry threaded-view readers)
>>>>
>>>> On 12/19/19 3:15 PM, Tim Peters wrote:
>>>>
>>>> [Nick]
>>>>
>>>> I took Larry's request a slightly different way:
>>>>
>>>> [...]
>>>>
>>>> Only Larry can answer whether that would meet his perceived need.  My
>>>> _guess_ is that he wouldn't know OrderedSet existed, and, even if he
>>>> did, he'd use a dict with None values anyway (because it's less hassle
>>>> and does everything he wanted).
>>>>
>>>>
>>>> At last, something I can speak knowledgeably about: Larry's use case!  
>>>> Regarding Larry, I'd say
>>>>
>>>> his use case was small enough that almost anything maintaining order would 
>>>> have worked, even a list,
>>>> he often does have a pretty good idea what goodies are salted away in the 
>>>> Python standard library, and
>>>> a hypothetical collections.OrderedSet would probably work just fine.  And 
>>>> he'd probably use it too, as that makes the code clearer / easier to read. 
>>>>  If you read code using an OrderedSet, you know what the intent was and 
>>>> what the semantics are.  If you see code using a dict but setting the 
>>>> values to None, you think "that's crazy" and now you have a puzzle to 
>>>> solve.  Let's hope those comments are accurate!
>>>>
>>>>
>>>> Also, speaking personally, at least once (maybe twice?) in this thread 
>>>> folks have asked "would YOU, Mr Larry, really want ordered sets if it 
>>>> meant sets were slightly slower?"
>>>>
>>>> The first thing I'd say is, I'm not sure why people should care about 
>>>> what's best for me.  That's sweet of you!  But you really shouldn't.
>>>>
>>>> The second thing is, my needs are modest, so the speed hit we're talking 
>>>> about would likely be statistically insignificant, for me.
>>>>
>>>> And the third thing is, I don't really put the set() API through much of a 
>>>> workout anyway.  I build sets, I add and remove items, I iterate over 
>>>> them, I do the occasional union, and only very rarely do I use anything 
>>>> else.  Even then, when I write code where I reach for a fancy operation 
>>>> like intersection or symmetric_difference--what tends to happen is, I have 
>>>> a rethink and a refactor, and after I simplify my code I don't need the 
>>>> fancy operations anymore.  I can't remember the last time I used one of 
>>>> these fancy operations where the code really stuck around for a long time.
>>>>
>>>> So, personally?  Sure, I'd take that tradeoff.  As already established, I 
>>>> like that dict objects maintain their order, and I think it'd be swell if 
>>>> sets maintained their order too.  I suspect the performance hit wouldn't 
>>>> affect me in any meaningful way, and I could benefit from the 
>>>> order-maintaining semantics.  I bet it'd have other benefits too, like 
>>>> making regression tests more stable.  And my (admittedly uninformed!) 
>>>> assumption is, the loss of performance would mostly be in these 
>>>> sophisticated operations that I never seem to use anyway.
>>>>
>>>> But I don't mistake my needs for the needs of the Python community at 
>>>> large.  I'd be mighty interested

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

Sorry!  A previous attempt to reply got sent before I typed anything :-(

Very briefly:

> >>> timeit.timeit("set(i for i in range(1000))", number=100_000)

[and other examples using a range of integers]

The collision resolution strategy for sets evolved to be fancier than
for dicts, to reduce cache misses.  This is important for sets because
the _only_ interesting thing about an element wrt a set is whether or
not the set contains it.   Lookup speed is everything for sets.

If you use a contiguous range of "reasonable" integers for keys, the
integer hash function is perfect:  there's never a collision.  So any
such test misses all the work Raymond did to speed set lookups.
String keys have sufficiently "random" hashes to reliably create
collisions, though.  Cache misses can, of course, have massive effects
on timing.

> Add (much faster for dicts):
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.330938750001224
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 5.788865385999088

In the former case you're primarily measuring the time to look up the
"add" method of sets (that's more expensive than adding 0 to the set).
A better comparison would, e.g., move `add = s.add` to a setup line,
and use plain "add(0)" in the loop.

That's it!
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5UKYDB36WDRRIYOLRD52QS4I43SCAAJ6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

[Kyle]
> ...
> For some reason, I had assumed in the back of my head (without
> giving it much thought) that the average collision rate would be the
> same for set items and dict keys. Thanks for the useful information.

I know the theoretical number of probes for dicts, but not for sets
anymore.  The latter use a mix of probe strategies now, "randomish"
jumps (same as for dicts) but also purely linear ("up by 1") probing
to try to exploit L1 cache.

It's not _apparent_ to me that the mix actually helps ;-)  I just
trust that Raymond timed stuff until he was sure it does.

> ...
> Ah, I forgot to consider how the hash function actually works for continuous
> integers. A better comparison to demonstrate the collision differences would
> likely use random strings.

And, at an extreme, a class with a __hash__ that always returns the
same integer.

> Also, I believe that max "reasonable" integer range of no collision
> is (-2305843009213693951, 2305843009213693951), ...

Any range  that does _not_ contain both -2 and -1 (-1 is an annoying
special case, with hash(-1) == hash(-2) == -2), and spans no more than
sys.hash_info.modulus integers.  Apart from that, the sign and
magnitude of the start of the range don't matter; e.g.,

>>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100)))
100
>>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100)))
100
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FXTRZLOHSWYV5KEGJ4DYNS7HOMUOPVE5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-23 Thread Tim Peters

>> Also, I believe that max "reasonable" integer range of no collision
>> is (-2305843009213693951, 2305843009213693951), ...

> Any range  that does _not_ contain both -2 and -1 (-1 is an annoying
> special case, with hash(-1) == hash(-2) == -2), and spans no more than
> sys.hash_info.modulus integers.  Apart from that, the sign and
> magnitude of the start of the range don't matter; e.g.,
>
> >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100)))
> 100
> >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100)))
> 100

Sorry again!  That only shows that the hash codes are unique.  Dicts
and sets use only the last N bits to determine the start of the probe
sequence, for some value of N >= 3 (depending on the table size).  So
for a table of size a million, the last 20 bits (100 ~= 2**20) are
interesting.  But same thing:

>>> MASK = (1 << 20) - 1
>>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100)))
100
>>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100)))
100
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/BXHOS73YMJDSVZUM74VYXXYN5WW63BZ4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-25 Thread Tim Peters

[Tim]
> I know the theoretical number of probes for dicts, but not for sets
> anymore.  The latter use a mix of probe strategies now, "randomish"
> jumps (same as for dicts) but also purely linear ("up by 1") probing
> to try to exploit L1 cache.
>
> It's not _apparent_ to me that the mix actually helps ;-)  I just
> trust that Raymond timed stuff until he was sure it does.

To illustrate, I contrived a case where the set optimizations clearly
pay off.  Output (3.8.1, Win 10 Pro, quad core box with plenty of
RAM):

11184810 nice keys
dict build0.54
dict iter 0.07
dict lookup   0.81
set build 0.54
set iter  0.08
set lookup0.82

11184810 nasty keys
dict build   23.32
dict iter 0.09
dict lookup  13.26
set build 9.79
set iter  0.28
set lookup8.46

I didn't make heroic efforts to keep the machine quiet, and there's
substantial variation across runs.  For example, there's no real
difference beteen "0.07" and "0.08".

Some things to note:

- The "nice" keys act much the same regardless of whether dict or set.
Dicts have an advantage for iteration "in theory" in the absence of
deletions because there are no "holes" in the area storing the
hash+key+value records.  But for these specific keys, the same is true
of sets:  the initial hash probes are at indices 0, 1, 2, ...,
len(the_set)-1, and there are no holes in its hash+key records either
(all the empty slots are together "at the right end").

- Sets get a lot of value out of their fancier probe sequence for the
nasty keys.  There are lots of collisions, and sets can much more
often resolve them from L1 cache.  Better than a factor of 2 for build
time is nothing to sneeze at.

- Iteration for sets is substantially slower in this case, for two
reasons:  (1) sets _do_ have holes for this collection of keys; and,
(2) while I don't think it's documented, the maximum load factor for
sets is lower than for dicts, and 11184810 was picked to give the
maximum load factor for a dict with 2**24 slots.  The set in this case
has twice as many slots as the dict, and all the extras are empty
slots that iteration has to skip over.

- I don't have a theory for why dict build time is _so_ much higher
than dict lookup time for the nasty keys.

- If you don't know a lot about how these things are implemented, just
about everything is baffling ;-)  Integer keys are important in
practice, but things about how modern dicts are implemented were
_driven_ by opportunities to exploit properties of the mostly-trivial
int hash function.  For "general" timing, it's better to use string
keys, whose hash codes are "pretty randomish" (but can vary across
runs).  When using int keys, it's really easy to _end up_ measuring
happy behaviors _specific_ to int keys.

Code:

from time import perf_counter as now
import sys
from collections import deque

def disp(text, f):
print(f"{text:20s} {f:5.2f}")

M = sys.hash_info.modulus
targetlen = 2**24 * 2 // 3 # highest load factor for dict
hc = 1
badkeys = []
while len(badkeys) < targetlen:
# add no more than 100 ints with hash code `hc`
toadd = min(100, targetlen - len(badkeys))
badkeys.extend(hc + i * M for i in range(toadd))
hc += 713 # more than less arbitrary
goodkeys = list(range(targetlen))

for tag, keys in ("nice", goodkeys), ("nasty", badkeys):
print()
print(len(keys), tag, "keys")

for thetype, builder in ("dict", dict.fromkeys), ("set", set):
s = now()
d = builder(keys)
f = now()
disp(thetype + " build", f-s)

s = now()
deque(d, maxlen=0)
f = now()
disp(thetype + " iter", f-s)

s = now()
for k in keys:
assert k in d
f = now()
disp(thetype + " lookup", f-s)

del d
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S3UB5TCSUBLYSESG7WO2HNCFBZB5N4BG/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-26 Thread Tim Peters

[Tim]
>> - I don't have a theory for why dict build time is _so_ much higher
>> than dict lookup time for the nasty keys.

To be clearer, in context this was meant to be _compared to_ the
situation for sets.  These were the numbers:

11184810 nasty keys
dict build   23.32
dict lookup  13.26
set build 9.79
set lookup8.46

Build time is higher than lookup time for both, which isn't
surprising.  But it's _much_ higher (whether in absolute or relative
terms) for dicts than for sets.

[Serhiy Storchaka ]
> 1/2 of items were copied to a new hashtable when the dict grew up. 1/4
> of items were copied twice. 1/8 of items were copied three times, etc.
> In average every item is copied 1 time, so the build time is twice the
> lookup time when the cost of lookup is dominated.

I agree the average total number of table insertions per element
approaches 2 either way (counting insertions due to internal table
growth as well as insertions visible from Python code).  That's why
it's expected that build time exceeds lookup time (the latter does
exactly one lookup per element).


> But the lookup benchmarking includes the overhead of iterating Python
> loop, which is more significant for "good" keys, thus the lookup time is
> larger than the build time.

Iteration time is too minor to care much about.  Even in the worst
case (sets with the nasty keys), the Python:

for k in d:
pass

takes under a second here.  Take a full second off the "set lookup"
time, and dict build time still exceeds dict lookup time (whether in
absolute or relative terms) much more so than for sets.

But I'm not going to pursue it.  The high-order bit was just to
demonstrate that the set object's more complicated (than dicts) probe
sequence does buy highly significant speedups - in at least one highly
contrived case.  Since the probe sequence changes were aimed at
reducing cache misses, a full account requires comparing cache misses
between the dict and set implementations.  That's a mess, and slogging
through it appears unlikey to result in insight.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/IYPHUKW7DGYNAFAC4AOCDYYUWANHN5YE/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-27 Thread Tim Peters

[Nick Coghlan ]
> I took Larry's request a slightly different way: he has a use case where
> he wants order preservation (so built in sets aren't good), but combined
> with low cost duplicate identification and elimination and removal of
> arbitrary elements (so lists and collections.deque aren't good). Organising
> a work queue that way seems common enough ...

Is it? I confess I haven't thought of a plausible use case. Larry
didn't really explain his problem, just suggested that an ordered set
would be "a solution" to it.

The problem: whether there are duplicates "in the queue" is a
question about an implementation detail, hard for me to translate to a
question about the _task_ to be solved.

For example, suppose "the task" is to make a deep copy of a web site.
A "job" consists of following a URL, sucking down the page text, and
adding new jobs for contained URLs on the page.

We probably don't want to suck down the page text multiple times for a
given URL, but checking whether a URL is currently already in the job
queue is asking a question about an implementation detail that misses
the point: we want to know whether that URL has already been chased,
period. Whether the URL is currently in the queue is irrelevant to
that. The only logical relationship is that a job that has finished
_was_ in the queue at some time before the job finished.

So, ya, I've seen and implemented lots of work queues along these
lines - but an OrderedSet would be an "attractive nuisance"
(offhandedly appearing to solve a problem it doesn't actually
address):

jobs = some_kind_of_queue()
finished_jobs = set()
...
while jobs:
job = jobs.get()
if job in finished_jobs:
continue
try:
work on the job
possibly adding (sub)jobs to the `jobs` queue
except TransientExceptions:
jobs.put(job) # try again later
else:
finished_jobs.add(job)

Clear? There is a queue here, and a set, but they can't be combined
in a useful way: the set contents have nothing to do with what's
currently in the queue - the set consists of jobs that have been
successfully completed; the queue doesn't care whether it contains
duplicates, and merely skips over jobs that have been completed by the
time the queue spits them out (which strikes me as more robust design
than duplicating logic everywhere a job is added to a queue to try to
prevent adding already-done jobs to begin with).

Similarly, in other places I've used a "finished" flag on the object
instead of a set, or a dict mapping jobs to info about job status and
results. But in all these cases, the vital info associated with a job
really has little to do with whether the job is currently sitting on a
queue.

If "is it currently on the queue?" _is_ a common question to ask,
perhaps someone could give a semi-realistic example?
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/2XDEVSO5S5ZLVZTLX43UHBOJWMXYJIIB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-28 Thread Tim Peters

[Larry]
> Here is the original description of my problem, from the original email in
> this thread.  I considered this an adequate explanation of my problem
> at the time.

>> I do have a use case for this. In one project I maintain a "ready" list of
>> jobs; I need to iterate over it, but I also want fast lookup because I
>> soemtimes remove jobs when they're subsequently marked "not ready".

Which is a description not of "the problem", but of the operations you
want to perform.  It's the first time in my life I've heard of a
"ready list of jobs" where the programmer sometimes decides later "oh,
no - this one and that one aren't ready after all" ;-)

Not that it matters much - you want the operations you want.  The lack
of "oh - of course - that's a problem we all face at times"
motivation, though, works against it being _compelling_ on its own.

> ...
> In subsequent emails I've clarified that my workloads are small enough--and
> computers are fast enough--that almost any data structure would work fine
> for me here, e.g. a list.
>
>  I don't think my needs should drive the decision making process regardless.
>  I only described my problem to get the conversation rolling.

Which it did :-)  Others went on to, explicitly or implicitly, suggest
that an ordered set must also, e.g., support the entire
MutableSequence interface, and even define what happens if you mutate
the ordered set _while_ iterating over it (lists define the results in
such cases, but ordered dicts raise an exception).


> I opine that anybody who iterates over set objects and has bugs in their
> code would appreciate set objects maintaining insertion order.  I suspect
> it would make their debugging easier, because given identical inputs their
> set iteration would behave identically, thus making their bugs that much
> more stable.  That's as much use case as I have for the feature.

That's much stronger to me than some weird FIFO-only queue supporting
fast deletion of interior elements (which - sorry - I've never had a
use for).

And fine by me if such a thing is added.  All else being equal, I'd
prefer that the builtin set type be ordered, simply because it's less
surprising. and dicts are ordered now (although, as Inada Naoki said,
their current implementation isn't suitable for a FIFO queue, doe
primarily to O(N) time needed to find "the first" key+value record).

I believe we agree that adding _some_ flavor of OrderedSet to
`collections` would be a fine start.  I'm just not cool with replacing
the builtin set before there's an implementation fast and
memory-efficient enough that _current_ heavy set users won't feel
betrayed by major slowdowns or memory bloat when the switch is made.

Toward that end, my opinion - which I won't defend - is that
OrderedSet should not promise better than O(N)-time indexing (if it
supports indexing at all), and should - like current sets and dicts -
try to raise an exception if it can detect that the container has been
mutated during iteration.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/AH4LQB3OCMCE22FEWUALRND5P6AHUWE6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

2019-12-29 Thread Tim Peters

[Larry]
> It's a lightweight abstract dependency graph.  Its nodes are opaque,
> only required to be hashable.  And it doesn't require that you give it
> all the nodes in strict dependency order.
>
> When you add a node, you can also optionally specify
> dependencies, and those dependencies aren't required
> to be nodes that the graph has previously seen.  So you can add
> a node A that depends on B and C, without showing it B or C
> first.  When that happens it creates placeholder nodes for B
> and C, and naturally these nodes have no dependencies.  You
> can then later inform the graph that node B has dependencies
> on X Y and Z.
>
> (My specific use case is a package manager.  Packages are identified
> with unique strings.  So the nodes I give the give the graph are simply
> those package names.  It's pretty common for me to get a package
> with dependencies on packages I haven't seen yet.  Designing the
> graph to create placeholders let me skip making two passes over
> the package list, pre-creating the package objects, etc etc.  This
> made the code simpler and presumably faster.)
>
> The graph API maintains an externally-visible "ready" iterable of
> nodes.  And since B can go from ready to not-ready, it can get added
> to "ready" and subsequently removed.
>
> Also, when a node is satisfied, you simply remove it from the graph.
>  If A depends on B and C, and they all represent jobs, and B and C
> have no dependencies, they'll be in the "ready" iterable.  You iterate
> over "ready", and execute those jobs, and assuming they are
> successful you then remove them from the graph.  When A's
> dependencies are all satisfied, it'll get added to the "ready" iterable.
>  And naturally when B and C were removed from the graph, they were
> removed from the "ready" iterable too.

OK!  You're doing a topological sort.  There are natural & simple ways
to do that right now that scale efficiently to very large graphs (time
& space linear in the sum of the number of nodes and the number of
dependencies).  Curiously, the difficulties are mostly driven by the
quality of _error_ messages you want (in case of, e.g., cycles in the
dependency graph).

Lots of those implementations became deterministic "by magic" when
ordered dicts were introduced.  This is why:  a bare bones
implementation needs to associate two pieces of info with each node:
a list of its immediate successors, and an integer count of the number
of its immediate predecessors.  A dict is the natural way to implement
that association.

When all the dependency info has been entered, then:

- First all nodes are emitted whose predecessor count is 0.  Iterating
over the association dict once is the natural way to find them, and
that order is defined now.

- Then, as each emitted node N is marked done:
  for child in N.successors:
  assert child.predcount > 0
  child.predcount -= 1
  if child.predcount == 0:
  emit(child)

That's about all there is to it.  But there's no end to the cruft that
_can_ be added to, e.g., verify that a node is marked done no more
than once, or to compute & display strongly connected components if
not all nodes are emitted, etc.

Ordered sets could improve this "natural" implementation if the
successor lists were successor sets instead, simply because then -
like lists - their iteration order would be defined, which can in turn
affect the order in which nodes are emitted in the loop just above.
But lists are adequate to the task, are cheaper in both time and
space, and can't _not_ be ordered ;-)


> Thus it's this "ready" collection that I want to a) be iterable, b) be 
> stable, and
> c) support fast removal.  I add and remove nodes from that set a lot, which I
> realized would perturb the iteration order from run to run, etc etc etc, and 
> here
> we are.

The sketch above (which is a bog-common way to implement topsorts)
doesn't build a data structure recording the final order, and, in
fact, never deletes anything.  You might protest that the initial
iteration step (scanning the association dict to find nodes with no
predecessors) is an expense you skip because nodes with predecessors
are deleted from your "ready" structure all along.  But nodes with
predecessors are _touched_ either way, and merely skipping over them
is bound to be cheaper than deleting them.

After that initial step, no search of any kind is needed again.

> I grant you maybe it's a weird approach.  But after two false starts, where I
> was starting to boil the oceans with ABCs and "node is satisfied" APis and
> separate iterator objects, I had a re-think and hit on this super lightweight
> design.  I'm actually quite pleased with it--it's short, it has a minimal API,
> it's readable, it was easy to get right, and it solves my problem.

Whereas I would have started with a topsort and finished it while I
was sleeping ;-)  Seriously, I've written such a thing many times, but
never reuse the code because it's so easy to redo

[Python-Dev] Re: Object deallocation during the finalization of Python program

2020-01-09 Thread Tim Peters

[Pau Freixes ]
> Recently I've been facing a really weird bug where a Python program
> was randomly segfaulting during the finalization, the program was
> using some C extensions via Cython.

There's nothing general that can be said that would help.  These
things require excruciating details to resolve.

It's generally the case that these things are caused by code missing
some aspect of correct use of Python's C API.  Which can be subtle!
It's very rare that such misuse is found in the core, but not so rare
in extensions.

Here's the most recent.  Be aware that it's a very long report, and -
indeed - is stuffed to the gills with excruciating details ;-)

https://bugs.python.org/issue38006

In that case a popular C extension wasn't really "playing by the
rules", but we changed CPython's cyclic garbage collector to prevent
it from segfaulting anyway.

So, one thing you didn't tell us:  which version of Python were you
using?  If not the most recent (3.8.1), try again with that (which
contains the patches discussed in the issue report above).

> Debugging the issue I realized that during the deallocation of one of
> the Python objects the deallocation function was trying to release a
> pointer that was surprisingly assigned to  NULL. The pointer was at
> the same time held by another Python object that was an attribute of
> the Python object that had the deallocation function, something like
> this:
>
> ...
>
> So my question would be, could CPython deallocate the objects during
> the finalization step without considering the dependencies between
> objects?

No, it could not.  BUT.  But but but.  If C code isn't playing by all
the rules, it's possible that CPython doesn't _know_ what the
dependencies actually are.  CPython can only know about pointers that
the user's C API calls tell CPython about.  In the report above, the C
extension didn't tell CPython about some pointers in cycles at all.

Curiously, for most kinds of garbage collectors, failing to inform the
collector about pointers is massively & widely & very reliably
catastrophic.  If it doesn't see a pointer, it can conclude that a
live object is actually trash.  But CPython's cyclic collector follows
pointers to prove objects _are_ trash, not to prove that they're
alive.  So when CPython isn't told about a pointer, it usually "fails
soft", believing a trash object is actually alive instead.  That can
cause a memory leak, but usually not a crash.

But not always.  In the presence of weakrefs too, believing a trash
object is actually alive can cause some kinds of finalization to be
done "too late", which can cause a crash when refcounting discovers
that the object actually is trash.  No, you're not going to get a
simple example of that ;-)  DIg through the issue report above.

> If this is not the right list to make this kind of questions, just let
> me know what would be the best place for making this kind of questions

Open an issue report instead?  Discussions won't solve it, alas.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WTH2LH3S52CN435B6I273BJ3QJW5GTES/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Documenting Python's float.str()

2020-01-21 Thread Tim Peters

[Serhiy Storchaka]
> This is not the only difference between '.17g' and repr().
>
> >>> '%.17g' % 1.23456789
> '1.23456788'
> >>> format(1.23456789, '.17g')
> '1.23456788'
> >>> repr(1.23456789)
> '1.23456789'

More amazingly ;-), repr() isn't even always the same as a %g format
specifying exactly the same number of digits as repr() produces.

That's because repr(x) currently returns the shortest string such that
eval(repr(x)) == x, but %g rounds correctly to the given number of
digits.  Not always the same thing!

>>> x = 2.0 ** 89
>>> print(repr(x))
6.189700196426902e+26
>>> print("%.16g" % x) # repr produced 16 digits
6.189700196426901e+26

The repr() output is NOT correctly rounded.  To see which one is
correctly rounded, here's an easy way:

>>> import decimal
>>> decimal.Decimal(x)
Decimal('618970019642690137449562112')

The "37449562112" is rounded off, and is less than half a unit in the
last place, so correct rounding truncates the last digit to 1.

But there is no string with 16 digits other than the incorrectly
rounded one repr() returns that gives x back.  In particular, the
correctly rounded 16 digit string does not:

>>> 6.189700196426901e+26 # 16-digit correctly rounded fails
6.189700196426901e+26
>>> x == _
False

To my mind it's idiotic(*) that "shortest string" requires incorrect
rounding in some cases.

In Python's history, eval(repr(x)) == x is something that was always
intended, so long as the writing and reading was done by the same
Python instance on the same machine.  Maybe it's time to document that
;-)

But CPython goes far beyond that now, also supplying correct rounding,
_except_ for repr's output, where - for reasons already illustrated -
"correct rounding" and "shortest" can't always both be satisfied.

(*) What wouldn't be idiotic?  For repr(x) to return the shortest
_correctly rounded_ string such that eval(repr(x)) == x.  In the
example, that would require repr(x) to produce a 17-digit output (and
17 is the most that's ever needed for a Python float on boxes with
IEEE doubles).  But "shortest string" was taken ultra literally by the
people who first worked out routines capable of doing that, so has
become a de facto standard now.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/F3BOGGTGJPZS3RR7FKG7YE6GYADHYI76/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?

2020-01-23 Thread Tim Peters

PyObject_RichCompareBool(x, y, op) has a (valuable!) shortcut:  if x
and y are the same object, then equality comparison returns True and
inequality False.  No attempt is made to execute __eq__ or __ne__
methods in those cases.

This has visible consequences all over the place, but they don't
appear to be documented.  For example,

>>> import math
>>> ([math.nan] * 5).count(math.nan)
5

despite that `math.nan == math.nan` is False.

It's usually clear which methods will be called, and when, but not
really here.  Any _context_ that calls PyObject_RichCompareBool()
under the covers, for an equality or inequality test, may or may not
invoke __eq__ or __ne__, depending on whether the comparands are the
same object.  Also any context that inlines these special cases to
avoid the overhead of calling PyObject_RichCompareBool() at all.

If it's intended that Python-the-language requires this, that needs to
be documented.

Or if it's implementation-defined, then _that_ needs to be documented.

Which isn't straightforward in either case, in part because
PyObject_RichCompareBool isn't a language-level concept.

This came up recently when someone _noticed_ the list.count(NaN)
behavior, and Victor made a PR to document it:

https://github.com/python/cpython/pull/18130

I'm pushing back, because documenting it _only_ for .count() makes
.count() seem unique in a way it isn't, and doesn't resolve the
fundamental issue:  is this language behavior, or implementation
behavior?

Which I don't want to argue about.  But you certainly should ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3ZAMS473HGHSI64XB3UV4XBICTG2DKVF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?

2020-01-24 Thread Tim Peters

[Inada Naoki ]
> FWIW, (list|tuple).__eq__ and (list|tuple).__contains__ uses it too.
> It is very important to compare recursive sequences.
>
> >>> x = []
> >>> x.append(x)
> >>> y = [x]
> >>> z = [x]
> >>> y == z
> True

That's a visible consequence, but I'm afraid this too must be
considered an implementation-defined quirk.  If doing a fine job of
comparing recursive containers was the actual goal, we would have
needed to do a much fancier thing, akin to testing graph isomorphism:

>>> a = []
>>> b = []
>>> for x in a, b:
... x.append(x)
>>> a
[[...]]
>>> b
[[...]]
>>> a == a
True
>>> b == b
True
>>> a == b
Traceback (most recent call last):
...
RecursionError: maximum recursion depth exceeded in comparison

Instead the happy behavior in your example (and in my first two
examples) just follows as a no-effort consequence of the
special-casing for identity equality in PyObject_RichCompareBool().

(Note that the _top_ level comparisons in my "a == a" and "b == b"
examples do _not_ exploit that the comparands are the same object -
it's list_richcompare() calling PyObject_RichCompareBool() where the
latter tells the former that a[0] and a[0] are equal.)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/L4RPMHX7USTWLACBC5URKB3WXNBH2DKA/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?

2020-01-24 Thread Tim Peters

[Terry Reedy ]
> ...
> It is, in the section on how to understand and use value comparison
> *operators* ('==', etc.).
> https://docs.python.org/3/reference/expressions.html#value-comparisons
>
> First "The default behavior for equality comparison (== and !=) is based
> on the identity of the objects."
>
> Then in particular, "The built-in containers typically assume identical
> objects are equal to themselves. That lets them bypass equality tests
> for identical objects to improve performance and to maintain their
> internal invariants."

Cool!  I didn't find that, and I assume Victor didn't either.  It's good!

I think it needs more words, though, to flesh out what about this is
allowed by the language (as opposed to what CPython happens to do),
and to get closer to what Guido is trying to get at with his
"*implicit* calls".  For example, it's at work here, but there's not a
built-in container in sight:

>>> import math
>>> def f():
... while True:
... yield math.nan
>>> math.nan in f()
True


>> For example, in __eq__() method documentation:
>> https://docs.python.org/dev/reference/datamodel.html#object.__eq

> The text that follows discusses rich comparison *special methods* and
> how to write them.  It should refer people to the value-comparisons
> section for information on specific classes, as in the second quote
> above.  It would not hurt if the operator section referred people back
> to special method discussion.  I think you should go ahead and add one
> or both links.

Agreed.  If I want to know why my __eq__ (or __ne__) isn't being
called, it's __eq__/__ne__ docs I'm going to look at first.  And
probably last, unless I'm nudged in the right direction.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/T4BAR4DVTMTR3GOJ2Y2AL2ZBCL6VSPUK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?

2020-01-24 Thread Tim Peters

[Tim]
>> I think it needs more words, though, to flesh out what about this is
>> allowed by the language (as opposed to what CPython happens to do),
>> and to get closer to what Guido is trying to get at with his
>> "*implicit* calls".  For example, it's at work here, but there's not a
>> built-in container in sight:
>>
>> >>> import math
>> >>> def f():
>> ... while True:
>> ... yield math.nan
>> >>> math.nan in f()
>> True

[Serhiy Storchaka ]
> It is documented in:
> https://docs.python.org/3/reference/expressions.html#membership-test-operations
>
> """
> For user-defined classes which do not define :meth:`__contains__` but do
> define
> :meth:`__iter__`, ``x in y`` is ``True`` if some value ``z``, for which the
> expression ``x is z or x == z`` is true, is produced while iterating
> over ``y``.
> ...

But _should_ it be?  There are two parts to that:

1. Consensus seems to be that whether - and where - implicit equality
comparisons exploit object identity should be implementation-defined.
If so, the quoted docs are over-specified, elevating a CPython
implementation choice to a language requirement.

2. As I said in the first message, consequences of this choice are
"all over the place".  Repeating equivalents of "x is z or x == z" all
over the docs is ugly and error-prone, and especially if it also needs
to spell out that the "x is z" part is something CPython does but that
other implementations may not do.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QLBATZTQQU5H4BGJAFRO3DNXN23NB7UU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?

2020-01-24 Thread Tim Peters

[Terry Reedy ]

]& skipping all the parts I agree with]

> ...
> Covered by "For user-defined classes which do not define __contains__()
> but do define __iter__(), x in y is True if some value z, for which the
> expression x is z or x == z is true, is produced while iterating over y.
> " in
>
> https://docs.python.org/3/reference/expressions.html#membership-test-operations

Just went through that with Serhiy.  Very briefly, two problems:

1. Those docs are over-specified if we intend that the "x is z"  is an
implementation choice rather than a language requirement.

2. Repeating all that stuff all over the docs is a PITA, error-prone,
and ugly, since contexts that call PyObject_RichCompareBool() under
the covers are all over the place too.

Document it carefully in _one_ place (the Reference Manual already has
your good start), and just plant links to that in high-value locations
elsewhere in the docs.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/4EVEQZZPEPAGSJ25SSG2EQWQ3MUP2ZHW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Python library for the ServiceNow REST API

2020-01-28 Thread Tim Peters

[Guido]
> Honestly that looked like a spammer.

I approved the message, and it looked like "probably spam" to me too.
But it may have just been a low-quality message, and the new moderator
UI still doesn't support adding custom text to a rejection message.
Under the old system, I _would_ have rejected it, but added a message
explaining why.

As I've said before, I give broad benefit-of-the-doubt to marginal
messages, and for that reason alone some spam is certain to get
approved from time to time.

A vast majority of spam directed to this list does get discarded,
because there's usually no plausible doubt at all.  I don't believe a
legitimate message has ever been rejected, and _that's_ more important
to me.

> Note that at the bottom they wrote
>
>   servicenow training
>
> A typical strategy for such spammers is to send a message to a popular
> mailing list with some text that looks vaguely related (to the spammer,
> who is usually not well versed in the subject of the list) and add some
> link that they want to promote (presumably a hacked or scam site).

I went to the site first, and my up-to-date Malwarebytes installation
(both the full commercial product and the browser extension) found
nothing to complain about except that the page linked to one ad
network (common as mud).  It got a cleaner report from that than
_most_ sites I visit from day to day.  For example, Malwarebytes also
points out that www.python.org links to one "ad network or tracker"
(ssl.google-analytics.com).


> The list moderator ought to ban the user.

I will not, but fine by me if some other python-dev moderator does.
That poster is still subject to moderation.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/EX2QEOZV3BRH5DH3JZOJPULSSZDQZDG3/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?

2020-02-03 Thread Tim Peters

[Tim]
>> PyObject_RichCompareBool(x, y, op) has a (valuable!) shortcut: if x
>> and y are the same object, then equality comparison returns True
>> and inequality False. No attempt is made to execute __eq__ or
>> __ne__ methods in those cases.
>> ...
>> If it's intended that Python-the-language requires this, that needs to
>> be documented.

[Raymond]
> This has been slowly, but perhaps incompletely documented over the
> years and has become baked in the some of the collections ABCs as well.
>  For example, Sequence.__contains__() is defined as:
>
> def __contains__(self, value):
> for v in self:
> if v is value or v == value:  # note the identity test
> return True
> return False

But it's unclear to me whether that's intended to constrain all
implementations, or is just mimicking CPython's list.__contains__.
That's always a problem with operational definitions.  For example,
does it also constrain all implementations to check in iteration
order?  The order can be visible, e.g, in the number of times v.__eq__
is called.


> Various collections need to assume reflexivity, not just for speed, but so 
> that we
> can reason about them and so that they can maintain internal consistency. For
> example, MutableSet defines pop() as:
>
> def pop(self):
> """Return the popped value.  Raise KeyError if empty."""
> it = iter(self)
> try:
> value = next(it)
> except StopIteration:
> raise KeyError from None
> self.discard(value)
> return value

As above, except  CPyhon's own set implementation implementation
doesn't faithfully conform to that:

>>> x = set(range(0, 10, 2))
>>> next(iter(x))
0
>>> x.pop() # returns first in iteration order
0
>>> x.add(1)
>>> next(iter(x))
1
>>> x.pop()  # ditto
1
>>> x.add(1)  # but try it again!
>>> next(iter(x))
1
>>> x.pop() # oops! didn't pop the first in iteration order
2

Not that I care ;-)  Just emphasizing that it's tricky to say no more
(or less) than what's intended.

> That pop() logic implicitly assumes an invariant between membership and 
> iteration:
>
>assert(x in collection for x in collection)

Missing an "all".

> We really don't want to pop() a value *x* and then find that *x* is still
> in the container.   This would happen if iter() found the *x*, but discard()
> couldn't find the object because the object can't or won't recognize itself:

Speaking of which, why is "discard()" called instead of "remove()"?
It's sending a mixed message:  discard() is appropriate when you're
_not_ sure the object being removed is present.


>  s = {float('NaN')}
>  s.pop()
>  assert not s  # Do we want the language to guarantee that
>   # s is now empty?  I think we must.

I can't imagine an actual container implementation that wouldn't. but
no actual container implements pop() in the odd way MutableSet.pop()
is written.  CPython's set.pop does nothing of the sort - doesn't even
have a pointer equality test (except against C's NULL and `dummy`,
used merely to find "the first (starting at the search finger)" slot
actually in use).

In a world where we decided that the identity shortcut is _not_
guaranteed by the language, the real consequence would be that the
MutableSet.pop() implementation would need to be changed (or made
NotImplemented, or documented as being specific to CPython).

> The code for clear() depends on pop() working:
>
> def clear(self):
> """This is slow (creates N new iterators!) but effective."""
> try:
> while True:
> self.pop()
> except KeyError:
> pass
>
> It would unfortunate if clear() could not guarantee a post-condition that the
> container is empty:

That's again a consequence of how MutableSet.pop was written.  No
actual container has any problem implementing clear() without needing
any kind of object comparison.

>  s = {float('NaN')}
>  s.clear()
>  assert not s   # Can this be allowed to fail?

No, but as above it's a very far stretch to say that clear() emptying
a container _relies_ on the object identity shortcut.  That's a just a
consequence of an odd specific clear() implementation, relying in turn
on an odd specific pop() implementation that assumes the shortcut is
in place.


> The case of count() is less clear-cut, but even there 
> identity-implies-equality
> improves our ability to reason about code:

Absolutely!  That "x is x implies equality" is very useful.  But
that's not the question ;-)

>  Given some list, *s*, possibly already populated, would you want the
> following code to always work:
>
>  c = s.count(x)
>  s.append(x)
>  assert s.count(x) == c + 1 # To me, this is fundamental
>   to what the word 
> "count" means.

I would, yes.  But it's also possible to define s.

[Python-Dev] Re: How to enable tracemalloc for the test suite?

2020-04-04 Thread Tim Peters

[Skip Montanaro ]
> I've got a memory issue in my modified Python interpreter I'm trying
> to debug. Output at the end of the problematic unit test looks like this:

To my eyes, you left out the most important part ;-) A traceback
showing who made the fatal free() call to begin with.

In debug mode, allocations are padded on both ends with various stuff:
FORBIDDENBYTEs (0xfd), a count of the number of bytes originally
requested, and a serial number incremented by 1 each time
alloc/realloc is called. The memory you requested is entirely filled
with CLEANBYTEs ()xcd).

On deallocation, that stuff is checked (that's where your run is
dying), and the memory is filled with DEADBYTEs (0xdd) (a weak but
nevertheless often effective way to detect "double free"s).

Now in your case, absolutely every byte shown is 0x00. There are no
useful clues at all remaining. It's not just a short buffer overrun,
_everything_ has been NULLed out, including before the requested
memory.

So something has gone nuts spraying 0 all over memory ... or the
pointer passed to free() was nonsense to begin with. If it were a
case of double free, we'd usually expect to see some DEADBYTEs in the
debug output. But there weren't any.

However, you did not get any output from tracemalloc, despite that it
sounds like you set up the right magic to activate it. Which leans
toward the "complete nonsense" hypothesis. If tracemalloc doesn't
find the address in its dict of currently-allocated addresses, looks
like it fails silently, simply not producing any output. Which is
what you're experiencing.

So where was free() called and what was passed to it? All the
evidence is consistent with the idea that what was passed to free()
was never obtained from a malloc or reallloc call to begin with.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/A32E5ZVJ4IZBNDHMLOXH2OOAB2CKAYJN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: How to enable tracemalloc for the test suite?

2020-04-04 Thread Tim Peters

[Skip Montanaro ]
> ...
> I thought setting PYTHONTRACEMALLOC should provoke some useful output,
> but I was confused into thinking I was (am?) still missed something
> because it continued to produce this message:
>
> Enable tracemalloc to get the memory block allocation traceback

Ah, I overlooked that.


> which suggests to me tracemalloc still isn't enabled. That's emitted
> from Modules/_tracemalloc.c and seems to be properly protected:
>
> if (!_Py_tracemalloc_config.tracing) {
> PUTS(fd, "Enable tracemalloc to get the memory block "
>  "allocation traceback\n\n");
> return;
> }
>
> so I think there is still more to do.

Agreed - for whatever reason, tracemalloc is not enabled for you at
the time your run dies, despite your

> % PYTHONTRACEMALLOC=5 ./python ./Tools/scripts/run_tests.py -R
5:50:reflog.txt test_rattlesnake

command line.  Have to leave that mystery for Victor!

So between the "obvious" possibilities:

- something is going nuts spraying 0 all over memory

- the pointer passed to free() wasn't actually obtained from a
malloc/realloc function

there's nothing in the output to favor one over the other.  Except
that dereferencing addresses near the pointer didn't blow up (null
bytes were returned), so the pointer is at least in the process's
address space.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/74XUF5UNOID5YZ2EOUJD5YXOSNHIBEGT/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: How to enable tracemalloc for the test suite?

2020-04-05 Thread Tim Peters

For posterity, just recording best guesses for the other mysteries left hanging:

- PYTHONTRACEMALLOC didn't work for you because Victor's traceback
showed that Py_FinalizeEx was executing _PyImport_Fini,, one statement
_after_ it disabled tracemalloc via _PyTraceMalloc_Fini.

- The address passed to free() was for the struct representing the
False singleton, which is static   Since it wasn't obtained from a
malloc/realloc call to begin with, the free() debug code didn't find
anything it expected to find.  As noted earlier, if tracemalloc had
still been active, it wouldn't have found that address in its
database, and so would have produced no output at all.

So - that was easy ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/BRPNFTDWNJKVZS3KUKWYPN46KGER6LDU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622: Structural Pattern Matching

2020-06-24 Thread Tim Peters

You got everything right the first time ;-) The PEP is an extended
illustration of "although that way may not be obvious at first unless
you're Dutch".

I too thought "why not else:?" at first. But "case _:" covers it in
the one obvious way after grasping how general wildcard matches are.
Introducing "else:" too would be adding a wart (redundancy) just to
stop shallow-first-impression whining.

"|" is also a fine way to express alternatives. "case" has its own
sub-language with its own rules, and "|" is widely used to express
alternatives (whether in regexps, formal grammars, ...). Spell it,
e.g., "or", and then I wonder "what does short-circuiting have to do
with it?". All reuse of symbols carries baggage.

".NAME" grated at first, but extends the idea that dotted names are
always constant value patterns to "if and only if". So it has mnemonic
value. When context alone can't distinguish whether a name is meant as
(in effect) an lvalue or an rvalue, no syntax decorations can prevent
coding errors. Names in destructuring constructs are overwhelmingly
intended as lvalues, so adding extra cruft to say "no, I meant rvalue"
is the pragmatic choice.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/YDEHQRHB5S4O6KP5ROUZGOKSR3H54T34/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622: Structural Pattern Matching

2020-06-24 Thread Tim Peters

[Tim]
>> ".NAME" grated at first, but extends the idea that dotted names are
>> always constant value patterns to "if and only if". So it has mnemonic
>> value. When context alone can't distinguish whether a name is meant as
>> (in effect) an lvalue or an rvalue, no syntax decorations can prevent
>> coding errors. Names in destructuring constructs are overwhelmingly
>> intended as lvalues, so adding extra cruft to say "no, I meant rvalue"
>> is the pragmatic choice.


[Glenn Linderman ]
> This is just a bunch of words to me, without meaning.
>
> I'd like to understand it.

Did you read the PEP?

> What do you mean by "the idea that dotted names are always constant
> value patterns"?

Under the PEP's "Constant Value Pattern"  section:

Every dotted name in a pattern is looked up using normal Python
name resolution rules, and the value is used for comparison by
equality with the matching expression (same as for literals).

That's what I meant.  "Contains a dot" implies "constant value pattern".

> What do you mean by 'extends (the above) to "if and only if" '?

Because the next sentence from the PEP:

 As a special case to avoid ambiguity with name patterns, simple
 names must be prefixed with a dot to be considered a reference:

completes turning "contains a dot" into a necessary and sufficient
("if and only if") condition for distinguishing a constant value
pattern from a name pattern.  Where "constant value pattern" and "name
pattern" are again used with the PEP's meanings.


> As a result of not understanding the above, I see no mnemonic value.

While I do.  "If I want `xyz` to be interpreted as a constant value
pattern, it needs to contain a dot: `.xyy` should do it.  If I want
`enums.HI` to be interpreted as a constant value, it already contains
a dot, so it will be."

> My understanding of the "." as proposed is that it is optional, except
> in cases where it would be ambiguous... seems like it would be better if
> it were required for one case or the other, so that there would be no
> need to determine whether or not it is ambiguous by the surrounding
> state/code/declarations.

A dot is required, when and only when you want the chunk of syntax to
be interpreted as a constant value pattern.  I said nothing at all
about "_leading_" dots, which appear to be all you have in mind there.
_Some_ dot is mandatory to make it a constant value pattern; a
_leading_ dot may or may not be required.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YEUQNW5NUAW2OGMRJN22G3JMSIB4PA4J/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622: Structural Pattern Matching

2020-06-24 Thread Tim Peters

[Ethan Furman ]
> "case _:" is easy to miss -- I missed it several times reading through the 
> PEP.

As I said, I don't care about "shallow first impressions".  I care
about how a thing hangs together _after_ climbing its learning curve -
which in this case is about a nanometer tall ;-)

You're not seriously going to maintain that you're struggling to grasp
the meaning of "case _:" now, right?

> Huh.  I would consider "case _:" to be the wart, especially since "case 
> default:"
> or "case anything:" or "case i_dont_care:" all do basically the same thing 
> (although
> they bind to the given name,

Having climbed the trivial learning curve, only a deliberate wise ass
would code garbage like "case i_dont_care:". I don't care about them
either. The one obvious way to do it has already been made clear to
them. You may as well, e.g., complain that there's nothing to stop a
wise ass from writing "-5+6" where everyone else writes "+1".

> while _ does not bind to anything, but of what practical importance is that?) 
> .

One obvious way to do it is of major practical importance.

> ...
> Besides which, if we use "|" instead of "or" then we can't later allow more
> general expressions.

Good!  The PEP is quite complicated enough already.  But if you want
to pursue this seriously, you're going to have your work cut for you
to explain why "|" is more sacred than "or" with respect to "more
general expressions".  If you don't want to preclude anything, then
you need to invent syntax that has no meaning at all now.

>> ".NAME" grated at first, but extends the idea that dotted names are
>> always constant value patterns to "if and only if". So it has mnemonic
>> value.

> How do you get from "." to "iff" ?

See reply to Glenn. Can you give an example of a dotted name that is
not a constant value pattern? An example of a non-dotted name that is?
If you can't do either (and I cannot)), then that's simply what "if
and only if" means.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YGTDPCASYUQYCMU5PS2S5YO7DBNDYYWP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622: Structural Pattern Matching

2020-06-24 Thread Tim Peters

[Taine Zhao ]
> "or" brings an intuition of the execution order of pattern matching, just
> like how people already know about "short-circuiting".
>
> "or" 's operator precedence also suggests the syntax of OR patterns.
>
> As we have "|"  as an existing operator, it seems that there might be
> cases that the precedence of "|" is not consistent with it in an
> expression. This will mislead users.
>
> You said "All reuse of symbols carries baggage", I'd say,
>
> All **inconsistent** reuse of symbols carries baggage,  but the
> consistent reuse builds good intuitive sense and shows the good
> taste of designers.

We're not talking about abstract computation here:  this is a specific
feature, and "|" is the _only_ infix operator.  The PEP considered and
rejected "&" and a unary "not", so that's the universe we're left
with.  With only one "operator", it's really hard to "mislead" ;-)

In any case, the model here is far more regular expressions than
Python int arithmetic or set unions.  "|" means essentially the same
thing in the PEP as it does in Python regexps:  tru supatterns one at
a time, left to right.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LH6WUBPBAEPSOAMSB37ZWA2Z2G7X47AB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622: Structural Pattern Matching

2020-06-25 Thread Tim Peters

[Rhodri James ]
> I'm seriously going to maintain that I will forget the meaning of "case
> _:" quickly and regularly,

Actually,  you won't - trust me ;-)

> just as I quickly and regularly forget to use
> "|" instead of "+" for set union.  More accurately, I will quickly and
> regularly forget that in this one place, "_" is special.

Because that's the opposite of "accurate". There's nothing special
about "_" "in this one place". It's but a single application of that
"_" is used as a wildcard in _all_ matching contexts throughout the
PEP.

And it's not even new with this PEP.  "_" is routinely used already in
lots of code to mean "the syntax requires a binding target here, but I
don't care about the binding", from

lists = [[] for _ in range(100)]

to

first, _, third = triple

The last is especially relevant, because that's already a form of destructuring.

The only thing new about this use of "_" in the PEP is that it
specifies no binding will occur. Binding does occur in the examples
above (because there's nothing AT ALL special about "_" now - it's
just a one-character identifier, and all the rest is convention,
including that the REPL uses it to store the value of the
last-displayed object).


>> See reply to Glenn. Can you give an example of a dotted name that is
>> not a constant value pattern? An example of a non-dotted name that is?
>> If you can't do either (and I cannot)), then that's simply what "if

>case long.chain.of.attributes:

That's a dotted name and so is a constant value pattern - read the PEP.

Every dotted name in a pattern is looked up using normal Python
name resolution rules, and the value is used for comparison by
equality with the matching expression (same as for literals).


> or more likely
>
>case (foo.x, foo.y)

Ditto.

> for the first.  For the second, it's a no-brainer that you can't have a
> non-dotted name as a constant value pattern, since the current constant
> value pattern mandates a leading dot.

Not so.  _Solme_ dot is necessary and sufficient to identify a
constant value pattern now.  A leading dot is only _required_ in case
an intended constant value pattern would have no dots otherwise.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VDDYNQO7JOEZ2ENSHWIJAYBGXAHLBVLI/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622: Structural Pattern Matching

2020-06-25 Thread Tim Peters

[Tim]
 See reply to Glenn. Can you give an example of a dotted name that is
 not a constant value pattern? An example of a non-dotted name that is?
 If you can't do either (and I cannot)), then that's simply what "if

 [Rhodri James ]
>>> case long.chain.of.attributes:

[Tim]
>> That's a dotted name and so is a constant value pattern - read the PEP.
>>
>>  Every dotted name in a pattern is looked up using normal Python
>>  name resolution rules, and the value is used for comparison by
>>  equality with the matching expression (same as for literals).

[Rhodri]
> Then I am surprised, which is worse.  "long.chain.of.attributes" looks
> like an assignment target, and I would have expected the case to have
> been a name pattern.

As always, I don't care whether something is obvious at first glance.
I care whether something can be learned with reasonable effort, and
"sticks" _after_ it's learned.  There's essentially nothing truly
obvious about programming.

This, from the PEP, is the entire grammar for a "name pattern'"

name_pattern: NAME !('.' | '(' | '=')

That's it.  A plain name not followed by a dot, left paren, or equality sign.

While it may or may not surprise any given person at first glance,
it's very simple.  Put a fraction of the effort into learning it as
you're willing to expend on articulating surprise, and it would
already be far behind you ;-)
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HQHVWUA7TSDOOGJTK4CO32CS3TRHEMW6/
Code of Conduct: http://python.org/psf/codeofconduct/

< 5 6 7 8 9 10 11 >

901 - 1000 of 1049 matches

Mail list logo