[issue43605] Eval/exec and comprehension scopes unclear in documentation

2021-03-27 Thread Cong Ma


Change by Cong Ma :


--
keywords: +patch
pull_requests: +23787
stage: needs patch -> patch review
pull_request: https://github.com/python/cpython/pull/25039

___
Python tracker 
<https://bugs.python.org/issue43605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43605] Eval/exec and comprehension scopes unclear in documentation

2021-03-27 Thread Cong Ma


Cong Ma  added the comment:

Some more context: Issue 37646. The demo in that one was "eval inside 
list-comprehension-scope", while this one is the other way around.

Perhaps another example may better illustrate the interplay between eval and 
the execution environment:

```
def f():
x = 1
def g():
return eval("x")
return g
enc = f()
enc()
```

We get ``NameError: name 'x' is not defined``.

The reason is that, during compilation the compiler doesn't and cannot care 
about what the string "x" means as an argument to eval(). To the compiler it's 
just a string constant passed to a function, and it's not much different from
```
return print("x")
```
The compiler decides that the enclosed function g() has no locals in its block. 
And since there's no global with the name ``x`` either, when the dynamic 
expression is evaluated in eval() in that environment, the name doesn't 
resolve, because "eval() doesn't have access to the enclosing scope".

But the following is different:

```
def f():
x = 1
def g():
x  # <- 
return eval("x")
return g
enc = f()
enc()  # return value: 1
```

The marked line introduces name ``x`` as a local by virtue of merely having it 
in an expression-statement. Inside the function block of g(), we can imagine 
that the name resolution "goes up one level" into the enclosing block of f() 
where it is bound to the int object. When eval() is called there, the name does 
resolve.

I'm trying to think up a mental model but I'm afraid I can't find a simple one, 
except "once compiled, it's compiled, and eval() must learn to work with the 
already-compiled code". A much more in-depth description of name binding and 
execution in CPython is given here:

https://tenthousandmeters.com/blog/python-behind-the-scenes-5-how-variables-are-implemented-in-cpython/

especially in the section "``LOAD_DEREF`` and ``STORE_DEREF``".

--

___
Python tracker 
<https://bugs.python.org/issue43605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43605] Eval/exec and comprehension scopes unclear in documentation

2021-03-27 Thread Cong Ma


Cong Ma  added the comment:

> sum(get(i) for i in range(len(l)))

This expression inside the body of ``func()`` references the name "get" and "l" 
(ell), both are local to the scope introduced by ``func()``. More specifically, 
these two names are referenced in the unnamed inner scope introduced by the 
generator-expression ``(get(i) for i in range(len(l)))``. It's as if you've 
passed into that inner scope the locals already introduced in func() by 
argument passing, e.g.

```
def func(...):
get = ...
ell = ...
def genexpr(a, b):
return 
sum(genexpr(get, ell))
```

> eval("get(0) + get(1) + get(2) + get(3)")

The expression in the string doesn't introduce its own scope. The name "get" is 
resolved because without additional arguments, eval() gets the locals from the 
calling scope's locals, which is where the name "get" came.

> eval("sum(get(i) for i in range(len(l)))", locals())

This tells eval() to use the calling scope's locals (the value returned by the 
call ``locals()``) as the globals for the evaluation of the expression in the 
string. When eval() executes the compiled code, it's as if that piece of code 
lives in an environment where names like "gets" and "l" (ell) are top-level. 
Therefore these names are resolved.

> eval("sum(get(i) for i in range(len(l)))")

Without explicitly telling eval() which globals/locals namespaces to use, 
eval() uses the current calling scope's. This is as if it were called like

eval("sum(get(i) for i in range(len(l)))", globals(), locals())

A problem arises. The generator expression in the string introduces an 
anonymous inner scope (let's call that scope "the box"). Inside the box, the 
name "i" is a local, there's no problem. But for the name "get", it's not local 
to the box, and it's not a global. Unlike other kinds of enclosed scope (for 
example, one introduced by an inner ``def`` block), "the box" has no way to 
look up names in enclosing scopes. This is the limitation referred to by the 
Language Reference's section on dynamic execution.

These are my attempts to explain why something works while others don't, based 
on my own understanding. I hope this helps somewhat, and if I made a mistake 
anywhere please correct them.

--

___
Python tracker 
<https://bugs.python.org/issue43605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43605] Eval/exec and comprehension scopes unclear in documentation

2021-03-27 Thread Cong Ma


Cong Ma  added the comment:

I'm preparing an update to the documentation of eval/exec. There are several 
issues, but chiefly I'll link to the appropriate sections in the Language 
Reference, and clean up some other inaccuracies. When it's ready I'll submit a 
PR for core devs to review.

--

___
Python tracker 
<https://bugs.python.org/issue43605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43605] Eval/exec and comprehension scopes unclear in documentation

2021-03-27 Thread Cong Ma


Cong Ma  added the comment:

> I think it is *very* reasonable to expect that calling eval() on a string 
> should have the exact same effect as if the code that is inside the eval had 
> been written as part of the source code.

I don't think Python's execution model is defined this way. The documentation 
on execution model says:

> The eval() and exec() functions do not have access to the full environment 
> for resolving names. Names may be resolved in the local and global namespaces 
> of the caller. Free variables are not resolved in the nearest enclosing 
> namespace, but in the global namespace.
> footnote: This limitation occurs because the code that is executed by these 
> operations is not available at the time the module is compiled.

https://docs.python.org/3/reference/executionmodel.html#interaction-with-dynamic-features

--

___
Python tracker 
<https://bugs.python.org/issue43605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script

2021-03-24 Thread Cong Ma


Cong Ma  added the comment:

> The extra macros are provided by optional packages. On Fedora and 
> Debian/Ubuntu the package is called autoconf-archive.

Thank you very much. This (and the patch) is clearing things up for me a lot. 
At first I thought I was supposed to copy the m4 files over from the autoconf 
archive into the m4 directory under the repo.

--

___
Python tracker 
<https://bugs.python.org/issue43617>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43605] Issue of scopes unclear in documentation, or wrongly implemented

2021-03-24 Thread Cong Ma


Cong Ma  added the comment:

I think this is in the same class of behaviours as

```
def func(l):
def get(i):
return l[i]
print(eval("(lambda x: get(x))(0)"))  # Call anonymous lambda with the 
constant 0 as argument
```
 
Calls like ``func(["spam"])`` will not "work", and ``NameError`` is raised.

In this case, inside the lambda's body the name "get" can't be resolved. For 
the lambda body, the name "get" is a nonlocal but there's no way to access a 
nonlocal in a lambda.

The comprehensions, like lambdas, are in their own nested scope.

--
nosy: +congma

___
Python tracker 
<https://bugs.python.org/issue43605>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script

2021-03-24 Thread Cong Ma


Cong Ma  added the comment:

>From the configure.ac file:

> dnl ***
> dnl * Please run autoreconf to test your changes! *
> dnl ***

I take it to mean "if configure.ac is changed, run autoreconf first", and 
that's what I did. Could you let me know what is the intended way to modify 
configure.ac and have it take effect?

--

___
Python tracker 
<https://bugs.python.org/issue43617>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.

2021-03-24 Thread Cong Ma


Cong Ma  added the comment:

> If you consider that there is a bug, please open a new issue since you closed 
> this one.

Please see the follow up in Issue 43617.

Many thanks for bearing with me.

--

___
Python tracker 
<https://bugs.python.org/issue43615>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script

2021-03-24 Thread Cong Ma


Change by Cong Ma :


--
type:  -> compile error

___
Python tracker 
<https://bugs.python.org/issue43617>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script

2021-03-24 Thread Cong Ma


New submission from Cong Ma :

The problem
---

In the repository, the definition for ``AX_CHECK_COMPILE_FLAG`` in Python's 
``configure.ac`` file is missing. If ``autoreconf`` is run, an invalid 
``configure`` script is generated. The following is the behaviour of running 
``autoreconf`` followed by ``configure``:

```
# In cpython repository top-level directory
$ autoreconf
$ mkdir build
$ cd build
$ ../configure  # <- using newly generated configure script
[... omitted ...]
checking for --enable-optimizations... no
../configure: line 6498: syntax error near unexpected token 
`-fno-semantic-interposition,'
../configure: line 6498: `  
AX_CHECK_COMPILE_FLAG(-fno-semantic-interposition,'
```


The solution


It appears a file was missing in the m4/ directory. The file matches this one 
from the Autoconf Archive:

https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html

Simply adding the correct m4 file to m4/ should make ``autoreconf`` work.

--
components: Build
messages: 389463
nosy: congma
priority: normal
severity: normal
status: open
title: Missing definition in configure.ac causing autoreconf to create damaged 
configure script

___
Python tracker 
<https://bugs.python.org/issue43617>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.

2021-03-24 Thread Cong Ma


Cong Ma  added the comment:

BTW, do we need to fix the missing definition of the AX_CHECK_COMPILE_FLAG 
macro in configure.ac? This is a separate problem, if a problem at all.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue43615>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.

2021-03-24 Thread Cong Ma


Cong Ma  added the comment:

Hello Victor,

I think you're right. This is bogus on my part. TL;DR: The Python version is 
3.8 but I was trying to understand what's going on using the latest source.

Full version: I was trying to understand why the following C file when compiled 
with -shared using Clang 11 generates a call to Py_FatalError():

```
#define PY_SSIZE_T_CLEAN
#include "Python.h"


void
unreach(void)
{
Py_UNREACHABLE();
}
```

The headers Python.h and also the ones pulled in by it were actually from 
Python 3.8 release, which unconditionally defines the macro as a call to 
Py_FatalError.

--

___
Python tracker 
<https://bugs.python.org/issue43615>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.

2021-03-24 Thread Cong Ma


New submission from Cong Ma :

(This is a summarized form of the commit message in the attached patch. I'm 
submitting a patch instead of a PR over GitHub, because it seems that the 
``autoreconf`` output files are part of the repository. In order for the 
changes to take effect in the repo, I may have to run ``autoreconf`` and add 
the clobbered output files to the repo, which I don't think is a good idea. 
Also on my system the ``autoreconf`` can only work correctly if I add a missing 
M4 file "ax_check_compile_flag.m4" from the Autoconf Archive 
<https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html> for 
the ``AX_CHECK_COMPILE_FLAG`` macro used in the existing ``configure.ac``. I 
don't think it's wise for me to introduce so many changes at once if most 
developers don't need to run ``autoreconf`` often.)

The problem
---

Definition of the ``Py_UNREACHABLE()`` macro relied on testing compiler 
versions in preprocessor directives. This is unreliable chiefly because 
compilers masquerade as each other.

The current implementation tests the ``__GNUC__`` and ``__GNUC_MINOR__`` macros 
as the logic (GCC version >= 4.5) for determining whether the compiler 
intrinsic ``__builtin_unreachable()`` is present (see commits eebaa9bf, 
24ba3b0d). However, Clang defines these macros too and can cause confusion. 
Clang 11 pretends to be GCC 4.2.1 in its predefined macros. As a result, Clang 
won't use the intrinsic even if it's supported. This doesn't seem to match the 
intent behind the original implementation.

The solution


Test the presence of the compiler-builtin ``__builtin_unreachable()`` at 
configure-time using Autoconf, and conditionally define the 
``Py_UNREACHABLE()`` macro depending on the configuration.

The idea is based on the ``ax_gcc_builtin.m4`` code [0] by Gabriele Svelto.

Alternative ideas
-

Recent versions of Clang and GCC support the ``__has_builtin()`` macro.
However, this may be unreliable before Clang 10 [1], while GCC support is only 
available as of GCC 10 and its semantics may not be the same as Clang's [2]. 
Therefore ``__has_builtin()`` may not be as useful as it seems.

We may attempt to improve the accuracy of version checking in ``#if`` 
directives, but this could be brittle and difficult to explain, verify, or 
maintain.

Links
-

[0] https://www.gnu.org/software/autoconf-archive/ax_gcc_builtin.html
[1] https://clang.llvm.org/docs/LanguageExtensions.html#has-builtin
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66970#c24

--
components: Build
files: 0001-Properly-implement-Py_UNREACHABLE-macro-using-autoco.patch
keywords: patch
messages: 389454
nosy: congma
priority: normal
severity: normal
status: open
title: [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.
type: enhancement
Added file: 
https://bugs.python.org/file49910/0001-Properly-implement-Py_UNREACHABLE-macro-using-autoco.patch

___
Python tracker 
<https://bugs.python.org/issue43615>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Cong Ma


Cong Ma  added the comment:

> Idea:  We could make this problem go away by making NaN a singleton.

Apart from the fact that NaN is not a specific value/object (as pointed out in 
the previous message by @mark.dickinson), currently each builtin singleton 
(None, True, False, etc.) in Python satisfies the following predicate:

`if s is a singleton, then s == s`

This is also satisfied by "flyweight" objects such as small ints.

It feels natural and unsurprising to have this, if not officially promised. 
Requiring NaN to be a singleton would violate this implicit understanding about 
singletons, and cause another surprise on top of the other surprises with NaNs 
(or with rich comparison in general).

Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is 
already nested inside two layers of conditionals in `_Py_HashDouble()` in 
Python/pyhash.c:

```
if (!Py_IS_FINITE(v)) {
if (Py_IS_INFINITY(v))
return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
else
return _PyHASH_NAN;
}
```
(v is the underlying C `double`, field `ob_fval` of `PyFloatObject`).

I don't feel performance would be hurt by rewriting `float_hash()` in 
Objects/floatobject.c to the effect of

```
if (!Py_IS_NAN(v->ob_fval)) {
return _Py_HashDouble(v->ob_fval)
}
else {
return _Py_HashPointer(v);
}
```
and simplify the conditionals in `_Py_HashDouble()`. But of course, only real 
benchmarking could tell. (Also, other float-like types would have to be 
adjusted, too).

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma


Cong Ma  added the comment:

Sorry, please ignore my rambling about "float() returning aliased object" -- in 
that case the problem with hashing doesn't arise.

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma


Cong Ma  added the comment:

Thank you @mark.dickinson for the detailed analysis.

In addition to your hypothetical usage examples, I am also trying to understand 
the implications for user code.

If judging by the issues people open on GitHub like this: 
https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run 
into the "NaN as key" problem every now and then, I guess. (I'm not familiar 
enough with the poster's real problem in the Pandas setting to comment about 
that one, but it seems to suggest people do run into "real world" problems that 
shares some common features with this one). There's also StackOverflow threads 
like this (https://stackoverflow.com/a/17062985) where people discuss hashing a 
data table that explicitly use NaN for missing values. The reason seems to be 
that "[e]mpirical evidence suggests that hashing is an effective strategy for 
dimensionality reduction and practical nonparametric estimation." 
(https://arxiv.org/pdf/0902.2206.pdf).

I cannot imagine whether the proposed change would make life easier for people 
who really want NaN keys to begin with. However I think it might reduce the 
exposure to worst-case performances in dict (and maybe set/frozenset), while 
not hurting Python's own self-consistency.

Maybe there's one thing to consider about future compatibility though... 
because the proposed fix depends on the current behaviour that floats (and by 
extension, built-in float-like objects such as Decimal and complex) are not 
cached, unlike small ints and interned Unicode objects. So when you explicitly 
construct a NaN by calling float(), you always get a new Python object. Is this 
guaranteed now on different platforms with different compilers? I'm trying to 
parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, 
where it seems to me that for certain configs it could be a `static const 
union` translation-unit-level constant. Does this affect the code that actually 
builds a Python object from an underlying NaN? (I apologize if this is a bad 
question). But if it doesn't and we're guaranteed to really have Python 
NaN-objects that don't alias if explicitly constructed, is this something 
unlikely to change in the future?

I also wonder if there's security implication for servers that process 
user-submitted input, maybe running a float() constructor on some string list, 
checking exceptions but silently succeeding with "nan": arguably this is not 
going to be as common an concern as the string-key collision DoS now foiled by 
hash randomization, but I'm not entirely sure.

On "theoretical interest".. yes theoretical interests can also be "real world" 
if one teaches CS theory in real world using Python, see 
https://bugs.python.org/issue43198#msg386849

So yes, I admit we're in an obscure corner of Python here. It's a tricky, 
maybe-theoretical, but seemingly annoying but easy-to-fix issue..

--

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma


New submission from Cong Ma :

Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour 
for dict if numerous existing keys are NaN. I think by hashing NaN using the 
generic object (or "pointer") hash instead, the worst-case situation can be 
alleviated without changing the semantics of either dict or float. However, 
this also requires changes to how complex and Decimal objects hash, and 
moreover incompatible change to sys.hash_info. I would like to hear how Python 
developers think about this matter.



Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in 
Include/pyhash.h) for the hash value of all floating point NaNs. The value is 
part of the sys.hashinfo API and is re-used by complex and Decimal in computing 
its hash in accordance with Python builtin-type documentation. [0]

(The doc [0] specifically says that "[a]ll hashable nans have the same hash 
value.")

This is normally not a great concern, except for the worst case performance. 
The problem is that, since they hash to the same value and they're guaranteed 
to compare unequal to any compatible numeric value -- not even to themselves, 
this means they're guaranteed to collide.

For this reason I'd like to question whether it is a good idea to have all 
hashable NaNs with the same hash value.

There has been some discussions about this over the Web for some time (see 
[1]). In [1] the demo Python script times the insertion of k distinct NaN keys 
(as different objects) into a freshly created dict. Since the keys are distinct 
and are guaranteed to collide with each other (if any), the run time of a 
single lookup/insertion is roughly linear to the existing number of NaN keys. 
I've recreated the same script using with a more modern Python (attached).

I'd suggest a fix for this worst-case behaviour: instead of returning the hash 
value 0 for all NaNs, use the generic object (pointer) hash for these objects. 
As a PoC (also in the attached script), it roughly means

```
class myfloat(float):
def __hash__(self):
if self != self:  # i.e., math.isnan(self)
return object.__hash__(self)
return super().__hash__(self)
```

This will
- keep the current behaviour of dict intact;
- keep the invariant `a == b implies hash(a) == hash(b)` intact, where 
applicable;
- uphold all the other rules for Python numeric objects listed in [0];
- make hash collisions no more likely than with object() instances (dict lookup 
time is amortized constant w.r.t. existing number of NaN keys).

However, it will
- not keep the current rule "All hashable nans have the same hash value";
- not be compatible with the current sys.hash_info API (requires the removal of 
the "nan" attribute from there and documenting the change);
- require congruent modifications in complex and Decimal too.

Additionally, I don't think this will affect module-level NaN "constants" such 
as math.nan and how they behave. The "NaN constant" has never been a problem to 
begin with. It's only the *distinct* NaN objects that may cause the worst-case 
behaviour.



Just for the record I'd also like to clear some outdated info or misconception 
about NaN keys in Python dicts. It's not true that NaN keys, once inserted, 
cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if 
you keep the original key *object* around by keeping a reference to it (or 
obtaining a new one from the dict by iterating over it). This, I think, is 
because Python dict compares for object identity before rich-comparing for 
equality in `lookdict()` in Objects/dictobject.c, so this works for `d = 
dict()`:

```
f = float("nan")
d[f] = "value"
v = d[f]
```

but this fails with `KeyError`, as it should:

```
d[float("nan")] = "value"
v = d[float("nan")]
```

In this regard the NaN float object behaves exactly like the object() instance 
as keys -- except for the collisions. That's why I think at least for floats 
the "object" hash is likely to work. The solution using PRNG [1] (implemented 
with the Go language) is not necessary for CPython because the live objects are 
already distinct.



Links:

[0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
[1] https://research.swtch.com/randhash
[2] 
https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

--
components: Interpreter Core, Library (Lib)
files: nan_key.py
messages: 388508
nosy: congma
priority: normal
severity: normal
status: open
title: Worst-case behaviour of hash collision with float NaN
type: performance
Added file: https://bugs.python.org/file49869/nan_key.py

___
Python tracker 
<https://bugs.python.org/issue43475>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com