[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...
Github user superbobry closed the pull request at: https://github.com/apache/spark/pull/21180 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/21180#discussion_r184889165 --- Diff: python/pyspark/serializers.py --- @@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs): for k, v in _old_namedtuple_kwdefaults.items(): kwargs[k] = kwargs.get(k, v) cls = _old_namedtuple(*args, **kwargs) -return _hack_namedtuple(cls) + +import sys +f = sys._getframe(1) --- End diff -- Another way to fix the module name is to get rid of the extra stack frame. This can be done by e.g. modifying the bytecode of `collections.namedtuple` so that it redirected to `_hack_namedtuple` when needed. However, I think that the current implementation (despite being magical) is better than bytecode hacking since it is as magical as `collections.namedtuple` itself. I can change the PR to do the same as `collections.namedtuple` for cross-interpreter compatibility, wdyt? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21180#discussion_r184870575 --- Diff: python/pyspark/serializers.py --- @@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs): for k, v in _old_namedtuple_kwdefaults.items(): kwargs[k] = kwargs.get(k, v) cls = _old_namedtuple(*args, **kwargs) -return _hack_namedtuple(cls) + +import sys +f = sys._getframe(1) --- End diff -- Yea but thing is, that the doc says this not guaranteed although most of Python implementations look having it - there's a risk here we should take (it could be broken in a specific implementation of Python although it sounds unlikely). Is there any other way to avoid this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...
Github user superbobry commented on a diff in the pull request: https://github.com/apache/spark/pull/21180#discussion_r184860643 --- Diff: python/pyspark/serializers.py --- @@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs): for k, v in _old_namedtuple_kwdefaults.items(): kwargs[k] = kwargs.get(k, v) cls = _old_namedtuple(*args, **kwargs) -return _hack_namedtuple(cls) + +import sys +f = sys._getframe(1) --- End diff -- Good point. [`collections.nametuple`](https://github.com/python/cpython/blob/master/Lib/collections/__init__.py#L466) has a fix for Jython and IronPython. I can backport it for completeness, but realistically, the probability of someone running PySpark on these implementations is not very high. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21180#discussion_r184854019 --- Diff: python/pyspark/serializers.py --- @@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs): for k, v in _old_namedtuple_kwdefaults.items(): kwargs[k] = kwargs.get(k, v) cls = _old_namedtuple(*args, **kwargs) -return _hack_namedtuple(cls) + +import sys +f = sys._getframe(1) --- End diff -- Hm .. https://docs.python.org/2/library/sys.html#sys._getframe > CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python. Is it safe to use it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/21180 [SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples, regardless if they were defined on the top level of some module, and are therefore picklable using the default ``__reduce__`` implementation, or not. The PR ensures that only the non-picklable namedtuples are hacked, i.e. the ones defined in the REPL or locally in a function or method. Note that the namedtuple might be defined locally but still be picklable without the hack applied. def define(): global Foo Foo = namedtuple("Foo", []) The current implementation does not cover such cases and will apply the hack anyway. Sidenote: the PR also fixes the module name of the hacked namedtuples. Due to an extra layer of indirection added by ``_hijack_namedtuple``, all hacked namedtuples had "collections" as ``__module__``. This behaviour is no longer the case. SerializationTestCase and RDDTests. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/criteo-forks/spark hijack-non-importable-namedtuple Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21180.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21180 commit 37b0f6d14fcd48b9bd05b6f43b5cfb6284200367 Author: Sergei Lebedev Date: 2018-04-27T14:34:12Z [SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples, regardless if they were defined on the top level of some module, and are therefore picklable using the default ``__reduce__`` implementation, or not. The PR ensures that only the non-picklable namedtuples are hacked, i.e. the ones defined in the REPL or locally in a function or method. Note that the namedtuple might be defined locally but still be picklable without the hack applied. def define(): global Foo Foo = namedtuple("Foo", []) The current implementation do not cover such cases, and will apply the hack anyway. Sidenote: the PR also fixes the module name of the hacked namedtuples. Due to an extra layer of indirection added by ``_hijack_namedtuple``, all hacked namedtuples had "collections" as ``__module__``. This behaviour is no longer the case. SerializationTestCase and RDDTests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org