[issue43605] Eval/exec and comprehension scopes unclear in documentation
Change by Cong Ma : -- keywords: +patch pull_requests: +23787 stage: needs patch -> patch review pull_request: https://github.com/python/cpython/pull/25039 ___ Python tracker <https://bugs.python.org/issue43605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43605] Eval/exec and comprehension scopes unclear in documentation
Cong Ma added the comment: Some more context: Issue 37646. The demo in that one was "eval inside list-comprehension-scope", while this one is the other way around. Perhaps another example may better illustrate the interplay between eval and the execution environment: ``` def f(): x = 1 def g(): return eval("x") return g enc = f() enc() ``` We get ``NameError: name 'x' is not defined``. The reason is that, during compilation the compiler doesn't and cannot care about what the string "x" means as an argument to eval(). To the compiler it's just a string constant passed to a function, and it's not much different from ``` return print("x") ``` The compiler decides that the enclosed function g() has no locals in its block. And since there's no global with the name ``x`` either, when the dynamic expression is evaluated in eval() in that environment, the name doesn't resolve, because "eval() doesn't have access to the enclosing scope". But the following is different: ``` def f(): x = 1 def g(): x # <- return eval("x") return g enc = f() enc() # return value: 1 ``` The marked line introduces name ``x`` as a local by virtue of merely having it in an expression-statement. Inside the function block of g(), we can imagine that the name resolution "goes up one level" into the enclosing block of f() where it is bound to the int object. When eval() is called there, the name does resolve. I'm trying to think up a mental model but I'm afraid I can't find a simple one, except "once compiled, it's compiled, and eval() must learn to work with the already-compiled code". A much more in-depth description of name binding and execution in CPython is given here: https://tenthousandmeters.com/blog/python-behind-the-scenes-5-how-variables-are-implemented-in-cpython/ especially in the section "``LOAD_DEREF`` and ``STORE_DEREF``". -- ___ Python tracker <https://bugs.python.org/issue43605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43605] Eval/exec and comprehension scopes unclear in documentation
Cong Ma added the comment: > sum(get(i) for i in range(len(l))) This expression inside the body of ``func()`` references the name "get" and "l" (ell), both are local to the scope introduced by ``func()``. More specifically, these two names are referenced in the unnamed inner scope introduced by the generator-expression ``(get(i) for i in range(len(l)))``. It's as if you've passed into that inner scope the locals already introduced in func() by argument passing, e.g. ``` def func(...): get = ... ell = ... def genexpr(a, b): return sum(genexpr(get, ell)) ``` > eval("get(0) + get(1) + get(2) + get(3)") The expression in the string doesn't introduce its own scope. The name "get" is resolved because without additional arguments, eval() gets the locals from the calling scope's locals, which is where the name "get" came. > eval("sum(get(i) for i in range(len(l)))", locals()) This tells eval() to use the calling scope's locals (the value returned by the call ``locals()``) as the globals for the evaluation of the expression in the string. When eval() executes the compiled code, it's as if that piece of code lives in an environment where names like "gets" and "l" (ell) are top-level. Therefore these names are resolved. > eval("sum(get(i) for i in range(len(l)))") Without explicitly telling eval() which globals/locals namespaces to use, eval() uses the current calling scope's. This is as if it were called like eval("sum(get(i) for i in range(len(l)))", globals(), locals()) A problem arises. The generator expression in the string introduces an anonymous inner scope (let's call that scope "the box"). Inside the box, the name "i" is a local, there's no problem. But for the name "get", it's not local to the box, and it's not a global. Unlike other kinds of enclosed scope (for example, one introduced by an inner ``def`` block), "the box" has no way to look up names in enclosing scopes. This is the limitation referred to by the Language Reference's section on dynamic execution. These are my attempts to explain why something works while others don't, based on my own understanding. I hope this helps somewhat, and if I made a mistake anywhere please correct them. -- ___ Python tracker <https://bugs.python.org/issue43605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43605] Eval/exec and comprehension scopes unclear in documentation
Cong Ma added the comment: I'm preparing an update to the documentation of eval/exec. There are several issues, but chiefly I'll link to the appropriate sections in the Language Reference, and clean up some other inaccuracies. When it's ready I'll submit a PR for core devs to review. -- ___ Python tracker <https://bugs.python.org/issue43605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43605] Eval/exec and comprehension scopes unclear in documentation
Cong Ma added the comment: > I think it is *very* reasonable to expect that calling eval() on a string > should have the exact same effect as if the code that is inside the eval had > been written as part of the source code. I don't think Python's execution model is defined this way. The documentation on execution model says: > The eval() and exec() functions do not have access to the full environment > for resolving names. Names may be resolved in the local and global namespaces > of the caller. Free variables are not resolved in the nearest enclosing > namespace, but in the global namespace. > footnote: This limitation occurs because the code that is executed by these > operations is not available at the time the module is compiled. https://docs.python.org/3/reference/executionmodel.html#interaction-with-dynamic-features -- ___ Python tracker <https://bugs.python.org/issue43605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script
Cong Ma added the comment: > The extra macros are provided by optional packages. On Fedora and > Debian/Ubuntu the package is called autoconf-archive. Thank you very much. This (and the patch) is clearing things up for me a lot. At first I thought I was supposed to copy the m4 files over from the autoconf archive into the m4 directory under the repo. -- ___ Python tracker <https://bugs.python.org/issue43617> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43605] Issue of scopes unclear in documentation, or wrongly implemented
Cong Ma added the comment: I think this is in the same class of behaviours as ``` def func(l): def get(i): return l[i] print(eval("(lambda x: get(x))(0)")) # Call anonymous lambda with the constant 0 as argument ``` Calls like ``func(["spam"])`` will not "work", and ``NameError`` is raised. In this case, inside the lambda's body the name "get" can't be resolved. For the lambda body, the name "get" is a nonlocal but there's no way to access a nonlocal in a lambda. The comprehensions, like lambdas, are in their own nested scope. -- nosy: +congma ___ Python tracker <https://bugs.python.org/issue43605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script
Cong Ma added the comment: >From the configure.ac file: > dnl *** > dnl * Please run autoreconf to test your changes! * > dnl *** I take it to mean "if configure.ac is changed, run autoreconf first", and that's what I did. Could you let me know what is the intended way to modify configure.ac and have it take effect? -- ___ Python tracker <https://bugs.python.org/issue43617> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.
Cong Ma added the comment: > If you consider that there is a bug, please open a new issue since you closed > this one. Please see the follow up in Issue 43617. Many thanks for bearing with me. -- ___ Python tracker <https://bugs.python.org/issue43615> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script
Change by Cong Ma : -- type: -> compile error ___ Python tracker <https://bugs.python.org/issue43617> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43617] Missing definition in configure.ac causing autoreconf to create damaged configure script
New submission from Cong Ma : The problem --- In the repository, the definition for ``AX_CHECK_COMPILE_FLAG`` in Python's ``configure.ac`` file is missing. If ``autoreconf`` is run, an invalid ``configure`` script is generated. The following is the behaviour of running ``autoreconf`` followed by ``configure``: ``` # In cpython repository top-level directory $ autoreconf $ mkdir build $ cd build $ ../configure # <- using newly generated configure script [... omitted ...] checking for --enable-optimizations... no ../configure: line 6498: syntax error near unexpected token `-fno-semantic-interposition,' ../configure: line 6498: ` AX_CHECK_COMPILE_FLAG(-fno-semantic-interposition,' ``` The solution It appears a file was missing in the m4/ directory. The file matches this one from the Autoconf Archive: https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html Simply adding the correct m4 file to m4/ should make ``autoreconf`` work. -- components: Build messages: 389463 nosy: congma priority: normal severity: normal status: open title: Missing definition in configure.ac causing autoreconf to create damaged configure script ___ Python tracker <https://bugs.python.org/issue43617> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.
Cong Ma added the comment: BTW, do we need to fix the missing definition of the AX_CHECK_COMPILE_FLAG macro in configure.ac? This is a separate problem, if a problem at all. -- resolution: -> not a bug stage: -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue43615> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.
Cong Ma added the comment: Hello Victor, I think you're right. This is bogus on my part. TL;DR: The Python version is 3.8 but I was trying to understand what's going on using the latest source. Full version: I was trying to understand why the following C file when compiled with -shared using Clang 11 generates a call to Py_FatalError(): ``` #define PY_SSIZE_T_CLEAN #include "Python.h" void unreach(void) { Py_UNREACHABLE(); } ``` The headers Python.h and also the ones pulled in by it were actually from Python 3.8 release, which unconditionally defines the macro as a call to Py_FatalError. -- ___ Python tracker <https://bugs.python.org/issue43615> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43615] [PATCH] Properly implement Py_UNREACHABLE macro using autoconf.
New submission from Cong Ma : (This is a summarized form of the commit message in the attached patch. I'm submitting a patch instead of a PR over GitHub, because it seems that the ``autoreconf`` output files are part of the repository. In order for the changes to take effect in the repo, I may have to run ``autoreconf`` and add the clobbered output files to the repo, which I don't think is a good idea. Also on my system the ``autoreconf`` can only work correctly if I add a missing M4 file "ax_check_compile_flag.m4" from the Autoconf Archive <https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html> for the ``AX_CHECK_COMPILE_FLAG`` macro used in the existing ``configure.ac``. I don't think it's wise for me to introduce so many changes at once if most developers don't need to run ``autoreconf`` often.) The problem --- Definition of the ``Py_UNREACHABLE()`` macro relied on testing compiler versions in preprocessor directives. This is unreliable chiefly because compilers masquerade as each other. The current implementation tests the ``__GNUC__`` and ``__GNUC_MINOR__`` macros as the logic (GCC version >= 4.5) for determining whether the compiler intrinsic ``__builtin_unreachable()`` is present (see commits eebaa9bf, 24ba3b0d). However, Clang defines these macros too and can cause confusion. Clang 11 pretends to be GCC 4.2.1 in its predefined macros. As a result, Clang won't use the intrinsic even if it's supported. This doesn't seem to match the intent behind the original implementation. The solution Test the presence of the compiler-builtin ``__builtin_unreachable()`` at configure-time using Autoconf, and conditionally define the ``Py_UNREACHABLE()`` macro depending on the configuration. The idea is based on the ``ax_gcc_builtin.m4`` code [0] by Gabriele Svelto. Alternative ideas - Recent versions of Clang and GCC support the ``__has_builtin()`` macro. However, this may be unreliable before Clang 10 [1], while GCC support is only available as of GCC 10 and its semantics may not be the same as Clang's [2]. Therefore ``__has_builtin()`` may not be as useful as it seems. We may attempt to improve the accuracy of version checking in ``#if`` directives, but this could be brittle and difficult to explain, verify, or maintain. Links - [0] https://www.gnu.org/software/autoconf-archive/ax_gcc_builtin.html [1] https://clang.llvm.org/docs/LanguageExtensions.html#has-builtin [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66970#c24 -- components: Build files: 0001-Properly-implement-Py_UNREACHABLE-macro-using-autoco.patch keywords: patch messages: 389454 nosy: congma priority: normal severity: normal status: open title: [PATCH] Properly implement Py_UNREACHABLE macro using autoconf. type: enhancement Added file: https://bugs.python.org/file49910/0001-Properly-implement-Py_UNREACHABLE-macro-using-autoco.patch ___ Python tracker <https://bugs.python.org/issue43615> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43475] Worst-case behaviour of hash collision with float NaN
Cong Ma added the comment: > Idea: We could make this problem go away by making NaN a singleton. Apart from the fact that NaN is not a specific value/object (as pointed out in the previous message by @mark.dickinson), currently each builtin singleton (None, True, False, etc.) in Python satisfies the following predicate: `if s is a singleton, then s == s` This is also satisfied by "flyweight" objects such as small ints. It feels natural and unsurprising to have this, if not officially promised. Requiring NaN to be a singleton would violate this implicit understanding about singletons, and cause another surprise on top of the other surprises with NaNs (or with rich comparison in general). Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is already nested inside two layers of conditionals in `_Py_HashDouble()` in Python/pyhash.c: ``` if (!Py_IS_FINITE(v)) { if (Py_IS_INFINITY(v)) return v > 0 ? _PyHASH_INF : -_PyHASH_INF; else return _PyHASH_NAN; } ``` (v is the underlying C `double`, field `ob_fval` of `PyFloatObject`). I don't feel performance would be hurt by rewriting `float_hash()` in Objects/floatobject.c to the effect of ``` if (!Py_IS_NAN(v->ob_fval)) { return _Py_HashDouble(v->ob_fval) } else { return _Py_HashPointer(v); } ``` and simplify the conditionals in `_Py_HashDouble()`. But of course, only real benchmarking could tell. (Also, other float-like types would have to be adjusted, too). -- ___ Python tracker <https://bugs.python.org/issue43475> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43475] Worst-case behaviour of hash collision with float NaN
Cong Ma added the comment: Sorry, please ignore my rambling about "float() returning aliased object" -- in that case the problem with hashing doesn't arise. -- ___ Python tracker <https://bugs.python.org/issue43475> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43475] Worst-case behaviour of hash collision with float NaN
Cong Ma added the comment: Thank you @mark.dickinson for the detailed analysis. In addition to your hypothetical usage examples, I am also trying to understand the implications for user code. If judging by the issues people open on GitHub like this: https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run into the "NaN as key" problem every now and then, I guess. (I'm not familiar enough with the poster's real problem in the Pandas setting to comment about that one, but it seems to suggest people do run into "real world" problems that shares some common features with this one). There's also StackOverflow threads like this (https://stackoverflow.com/a/17062985) where people discuss hashing a data table that explicitly use NaN for missing values. The reason seems to be that "[e]mpirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation." (https://arxiv.org/pdf/0902.2206.pdf). I cannot imagine whether the proposed change would make life easier for people who really want NaN keys to begin with. However I think it might reduce the exposure to worst-case performances in dict (and maybe set/frozenset), while not hurting Python's own self-consistency. Maybe there's one thing to consider about future compatibility though... because the proposed fix depends on the current behaviour that floats (and by extension, built-in float-like objects such as Decimal and complex) are not cached, unlike small ints and interned Unicode objects. So when you explicitly construct a NaN by calling float(), you always get a new Python object. Is this guaranteed now on different platforms with different compilers? I'm trying to parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, where it seems to me that for certain configs it could be a `static const union` translation-unit-level constant. Does this affect the code that actually builds a Python object from an underlying NaN? (I apologize if this is a bad question). But if it doesn't and we're guaranteed to really have Python NaN-objects that don't alias if explicitly constructed, is this something unlikely to change in the future? I also wonder if there's security implication for servers that process user-submitted input, maybe running a float() constructor on some string list, checking exceptions but silently succeeding with "nan": arguably this is not going to be as common an concern as the string-key collision DoS now foiled by hash randomization, but I'm not entirely sure. On "theoretical interest".. yes theoretical interests can also be "real world" if one teaches CS theory in real world using Python, see https://bugs.python.org/issue43198#msg386849 So yes, I admit we're in an obscure corner of Python here. It's a tricky, maybe-theoretical, but seemingly annoying but easy-to-fix issue.. -- ___ Python tracker <https://bugs.python.org/issue43475> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43475] Worst-case behaviour of hash collision with float NaN
New submission from Cong Ma : Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour for dict if numerous existing keys are NaN. I think by hashing NaN using the generic object (or "pointer") hash instead, the worst-case situation can be alleviated without changing the semantics of either dict or float. However, this also requires changes to how complex and Decimal objects hash, and moreover incompatible change to sys.hash_info. I would like to hear how Python developers think about this matter. Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in Include/pyhash.h) for the hash value of all floating point NaNs. The value is part of the sys.hashinfo API and is re-used by complex and Decimal in computing its hash in accordance with Python builtin-type documentation. [0] (The doc [0] specifically says that "[a]ll hashable nans have the same hash value.") This is normally not a great concern, except for the worst case performance. The problem is that, since they hash to the same value and they're guaranteed to compare unequal to any compatible numeric value -- not even to themselves, this means they're guaranteed to collide. For this reason I'd like to question whether it is a good idea to have all hashable NaNs with the same hash value. There has been some discussions about this over the Web for some time (see [1]). In [1] the demo Python script times the insertion of k distinct NaN keys (as different objects) into a freshly created dict. Since the keys are distinct and are guaranteed to collide with each other (if any), the run time of a single lookup/insertion is roughly linear to the existing number of NaN keys. I've recreated the same script using with a more modern Python (attached). I'd suggest a fix for this worst-case behaviour: instead of returning the hash value 0 for all NaNs, use the generic object (pointer) hash for these objects. As a PoC (also in the attached script), it roughly means ``` class myfloat(float): def __hash__(self): if self != self: # i.e., math.isnan(self) return object.__hash__(self) return super().__hash__(self) ``` This will - keep the current behaviour of dict intact; - keep the invariant `a == b implies hash(a) == hash(b)` intact, where applicable; - uphold all the other rules for Python numeric objects listed in [0]; - make hash collisions no more likely than with object() instances (dict lookup time is amortized constant w.r.t. existing number of NaN keys). However, it will - not keep the current rule "All hashable nans have the same hash value"; - not be compatible with the current sys.hash_info API (requires the removal of the "nan" attribute from there and documenting the change); - require congruent modifications in complex and Decimal too. Additionally, I don't think this will affect module-level NaN "constants" such as math.nan and how they behave. The "NaN constant" has never been a problem to begin with. It's only the *distinct* NaN objects that may cause the worst-case behaviour. Just for the record I'd also like to clear some outdated info or misconception about NaN keys in Python dicts. It's not true that NaN keys, once inserted, cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if you keep the original key *object* around by keeping a reference to it (or obtaining a new one from the dict by iterating over it). This, I think, is because Python dict compares for object identity before rich-comparing for equality in `lookdict()` in Objects/dictobject.c, so this works for `d = dict()`: ``` f = float("nan") d[f] = "value" v = d[f] ``` but this fails with `KeyError`, as it should: ``` d[float("nan")] = "value" v = d[float("nan")] ``` In this regard the NaN float object behaves exactly like the object() instance as keys -- except for the collisions. That's why I think at least for floats the "object" hash is likely to work. The solution using PRNG [1] (implemented with the Go language) is not necessary for CPython because the live objects are already distinct. Links: [0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types [1] https://research.swtch.com/randhash [2] https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/ -- components: Interpreter Core, Library (Lib) files: nan_key.py messages: 388508 nosy: congma priority: normal severity: normal status: open title: Worst-case behaviour of hash collision with float NaN type: performance Added file: https://bugs.python.org/file49869/nan_key.py ___ Python tracker <https://bugs.python.org/issue43475> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com