[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-09-27 Thread superbobry
Github user superbobry closed the pull request at:

https://github.com/apache/spark/pull/21180


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-29 Thread superbobry
Github user superbobry commented on a diff in the pull request:

https://github.com/apache/spark/pull/21180#discussion_r184889165
  
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
 for k, v in _old_namedtuple_kwdefaults.items():
 kwargs[k] = kwargs.get(k, v)
 cls = _old_namedtuple(*args, **kwargs)
-return _hack_namedtuple(cls)
+
+import sys
+f = sys._getframe(1)
--- End diff --

Another way to fix the module name is to get rid of the extra stack frame. 
This can be done by e.g. modifying the bytecode of `collections.namedtuple` so 
that it redirected to `_hack_namedtuple` when needed. However, I think that the 
current implementation (despite being magical) is better than bytecode hacking 
since it is as magical as `collections.namedtuple` itself.

I can change the PR to do the same as `collections.namedtuple` for 
cross-interpreter compatibility, wdyt?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21180#discussion_r184870575
  
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
 for k, v in _old_namedtuple_kwdefaults.items():
 kwargs[k] = kwargs.get(k, v)
 cls = _old_namedtuple(*args, **kwargs)
-return _hack_namedtuple(cls)
+
+import sys
+f = sys._getframe(1)
--- End diff --

Yea but thing is, that the doc says this not guaranteed although most of 
Python implementations look having it - there's a risk here we should take (it 
could be broken in a specific implementation of Python although it sounds 
unlikely). Is there any other way to avoid this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-28 Thread superbobry
Github user superbobry commented on a diff in the pull request:

https://github.com/apache/spark/pull/21180#discussion_r184860643
  
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
 for k, v in _old_namedtuple_kwdefaults.items():
 kwargs[k] = kwargs.get(k, v)
 cls = _old_namedtuple(*args, **kwargs)
-return _hack_namedtuple(cls)
+
+import sys
+f = sys._getframe(1)
--- End diff --

Good point. 
[`collections.nametuple`](https://github.com/python/cpython/blob/master/Lib/collections/__init__.py#L466)
 has a fix for Jython and IronPython. I can backport it for completeness, but 
realistically, the probability of someone running PySpark on these 
implementations is not very high.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21180#discussion_r184854019
  
--- Diff: python/pyspark/serializers.py ---
@@ -523,7 +523,21 @@ def namedtuple(*args, **kwargs):
 for k, v in _old_namedtuple_kwdefaults.items():
 kwargs[k] = kwargs.get(k, v)
 cls = _old_namedtuple(*args, **kwargs)
-return _hack_namedtuple(cls)
+
+import sys
+f = sys._getframe(1)
--- End diff --

Hm .. https://docs.python.org/2/library/sys.html#sys._getframe

> CPython implementation detail: This function should be used for internal 
and specialized purposes only. It is not guaranteed to exist in all 
implementations of Python.

Is it safe to use it? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple f...

2018-04-27 Thread superbobry
GitHub user superbobry opened a pull request:

https://github.com/apache/spark/pull/21180

[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples

Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples,
regardless if they were defined on the top level of some module, and
are therefore picklable using the default ``__reduce__`` implementation,
or not.

The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.

Note that the namedtuple might be defined locally but still be picklable
without the hack applied.

def define():
global Foo
Foo = namedtuple("Foo", [])

The current implementation does not cover such cases and will apply
the hack anyway.

Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by ``_hijack_namedtuple``,
all hacked namedtuples had "collections" as ``__module__``. This
behaviour is no longer the case.

SerializationTestCase and RDDTests.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark 
hijack-non-importable-namedtuple

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21180.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21180


commit 37b0f6d14fcd48b9bd05b6f43b5cfb6284200367
Author: Sergei Lebedev 
Date:   2018-04-27T14:34:12Z

[SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples

Prior to this PR ``_hack_namedtuple`` was applied to all namedtuples,
regardless if they were defined on the top level of some module, and
are therefore picklable using the default ``__reduce__`` implementation,
or not.

The PR ensures that only the non-picklable namedtuples are hacked, i.e.
the ones defined in the REPL or locally in a function or method.

Note that the namedtuple might be defined locally but still be picklable
without the hack applied.

def define():
global Foo
Foo = namedtuple("Foo", [])

The current implementation do not cover such cases, and will apply
the hack anyway.

Sidenote: the PR also fixes the module name of the hacked namedtuples.
Due to an extra layer of indirection added by ``_hijack_namedtuple``,
all hacked namedtuples had "collections" as ``__module__``. This
behaviour is no longer the case.

SerializationTestCase and RDDTests.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org