[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16536


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-31 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98804548
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,23 +1826,35 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+# It is possible that concurent access, to newly created UDF,
+# will initalize multiple UserDefinedPythonFunctions.
+# This is unlikely, doesn't affect correctness,
+# and should have a minimal performance impact.
+if self._judf_placeholder is None:
+self._judf_placeholder = self._create_judf()
+return self._judf_placeholder
+
+def _create_judf(self):
 from pyspark.sql import SparkSession
+
 sc = SparkContext.getOrCreate()
-wrapped_func = _wrap_function(sc, self.func, self.returnType)
 spark = SparkSession.builder.getOrCreate()
+
+wrapped_func = _wrap_function(sc, self.func, self.returnType)
 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
-if name is None:
-f = self.func
-name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__
 judf = 
sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
-name, wrapped_func, jdt)
+self._name, wrapped_func, jdt)
 return judf
 
 def __call__(self, *cols):
-sc = SparkContext._active_spark_context
+sc = SparkContext.getOrCreate()
--- End diff --

Though I am not sure if it really matters here. If `_instantiatedContext` 
is not None we'll do the same thing, otherwise we fall back to initialization 
in `_judf`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-31 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98802783
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,23 +1826,35 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+# It is possible that concurent access, to newly created UDF,
+# will initalize multiple UserDefinedPythonFunctions.
+# This is unlikely, doesn't affect correctness,
+# and should have a minimal performance impact.
+if self._judf_placeholder is None:
+self._judf_placeholder = self._create_judf()
+return self._judf_placeholder
+
+def _create_judf(self):
 from pyspark.sql import SparkSession
+
 sc = SparkContext.getOrCreate()
-wrapped_func = _wrap_function(sc, self.func, self.returnType)
 spark = SparkSession.builder.getOrCreate()
+
+wrapped_func = _wrap_function(sc, self.func, self.returnType)
 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
-if name is None:
-f = self.func
-name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__
 judf = 
sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
-name, wrapped_func, jdt)
+self._name, wrapped_func, jdt)
 return judf
 
 def __call__(self, *cols):
-sc = SparkContext._active_spark_context
+sc = SparkContext.getOrCreate()
--- End diff --

@holdenk Sounds reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-31 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98801781
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,23 +1826,35 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+# It is possible that concurent access, to newly created UDF,
+# will initalize multiple UserDefinedPythonFunctions.
+# This is unlikely, doesn't affect correctness,
+# and should have a minimal performance impact.
+if self._judf_placeholder is None:
+self._judf_placeholder = self._create_judf()
+return self._judf_placeholder
+
+def _create_judf(self):
 from pyspark.sql import SparkSession
+
 sc = SparkContext.getOrCreate()
-wrapped_func = _wrap_function(sc, self.func, self.returnType)
 spark = SparkSession.builder.getOrCreate()
+
+wrapped_func = _wrap_function(sc, self.func, self.returnType)
 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
-if name is None:
-f = self.func
-name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__
 judf = 
sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
-name, wrapped_func, jdt)
+self._name, wrapped_func, jdt)
 return judf
 
 def __call__(self, *cols):
-sc = SparkContext._active_spark_context
+sc = SparkContext.getOrCreate()
--- End diff --

So by switching this to `getOrCreate` we put a lock acquisition in the path 
of `__call__` which is maybe not ideal. We could maybe fix this by getting 
`_judf` first (e.g. `judf = self._judf`)? (Although it should be a mostly 
uncontended lock so it shouldn't be that bad, but if we ended up having a 
multi-threaded PySpark DataFrame UDF application this could maybe degrade a 
little bit).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-30 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98543788
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1947,6 +1968,29 @@ def 
test_sparksession_with_stopped_sparkcontext(self):
 df.collect()
 
 
+class UDFInitializationTests(unittest.TestCase):
--- End diff --

And added a separate test case checking `SparkContext` and `SparkSession` 
state.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-30 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98543649
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -459,6 +459,23 @@ def filename(path):
 row2 = df2.select(sameText(df2['file'])).first()
 self.assertTrue(row2[0].find("people.json") != -1)
 
+def test_udf_defers_judf_initalization(self):
--- End diff --

I'll these tests, to make sure that `_judf` is initialized when necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-30 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98543154
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -459,6 +459,23 @@ def filename(path):
 row2 = df2.select(sameText(df2['file'])).first()
 self.assertTrue(row2[0].find("people.json") != -1)
 
+def test_udf_defers_judf_initalization(self):
--- End diff --

@holdenk Separate case it is. As long as implementation is correct an 
overhead is negligible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-30 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98542899
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

@holdenk Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98093311
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
 "judf should be initialized after UDF has been called."
 )
 
+@unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't 
support mocking")
--- End diff --

```python
class UDFInitializationTestCase(unittest.TestCase):
def tearDown(self):
if SparkSession._instantiatedSession is not None:
SparkSession._instantiatedSession.stop()

if SparkContext._active_spark_context is not None:
SparkContext._active_spark_contex.stop()

def test_udf_context_access(self):
from pyspark.sql.functions import UserDefinedFunction
f = UserDefinedFunction(lambda x: x, StringType())
self.assertIsNone(SparkContext._active_spark_context)
self.assertIsNone(SparkSession._instantiatedSession)

```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98091656
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
 "judf should be initialized after UDF has been called."
 )
 
+@unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't 
support mocking")
--- End diff --

@holdenk  Do you mean something like `SparkContext._active_spark_context is 
None`? Then we need to make sure it is tear down if it was initialized after 
all, right? Isn't mocking cleaner?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98090756
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
 "judf should be initialized after UDF has been called."
 )
 
+@unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't 
support mocking")
--- End diff --

You can create a testcase without the Spark base and verify that creating a 
UDF doesn't create a SparkContext. This does not require making a SparkContext.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98084776
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
 "judf should be initialized after UDF has been called."
 )
 
+@unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't 
support mocking")
--- End diff --

@holdenk I believe that for a full test `udf` would have to create 
`SparkContext`. But mock is cheap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98077397
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -459,6 +459,23 @@ def filename(path):
 row2 = df2.select(sameText(df2['file'])).first()
 self.assertTrue(row2[0].find("people.json") != -1)
 
+def test_udf_defers_judf_initalization(self):
--- End diff --

I think a seperate test case and would able to be pretty light weight since 
it doesn't need to create a SparkContext or anything which traditionally takes 
longer to set up. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98077102
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

Ok, I'd maybe just leave a comment saying that we've left out the lock 
since double creation is both unlikely and OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98055966
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

I stress tested this a bit and I haven't found any abnormalities but I 
found a small problem with `__call__` on the way.  Fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98049505
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

Yeah, I don't think it should require a lock. I think concurrent calls are 
very unlikely and safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98048118
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

@rdblue I get this part, and this is a possible scenario. Question is if 
this justifies preventive lock. As far as I am aware there should be no 
correctness issues here. `SparkSession` already locks during initialization so 
we are safe there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r98043331
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

I think @holdenk's concern is that this would allow concurrent calls to 
`_create_udf`. That would create two `UserDefinedPythonFunction` objects, but I 
don't see anything on the Scala side that is concerning about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r97951125
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -459,6 +459,23 @@ def filename(path):
 row2 = df2.select(sameText(df2['file'])).first()
 self.assertTrue(row2[0].find("people.json") != -1)
 
+def test_udf_defers_judf_initalization(self):
--- End diff --

I thought about it but I have this impression, maybe incorrect, that we 
avoid creating new contexts to keep total execution time manageable. If you 
think this justifies a separate  `TestCase` I am more than fine with that 
([SPARK-19224](https://issues.apache.org/jira/browse/SPARK-19224) and 
[[PYSPARK] Python tests organization 
](http://apache-spark-developers-list.1001551.n3.nabble.com/PYSPARK-Python-tests-organization-td20542.html),
 right?).

If not, we could mock this, and put assert on the number of calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-26 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r97951062
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

Could you elaborate a bit? I am not sure if I understand the issue. 

Assignment is atomic (so we don't have to worry about corruption), for any 
practical purpose operation is idempotent (we can return expressions using 
different Java objects but as far as I am concerned this is just a detail of 
implementation), access to Py4J is thread safe and as far as I remember  
function registries are synchronized. I there any issue i missed here?

Thanks for looking into this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-25 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r97910321
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
+self._name = name or (
+func.__name__ if hasattr(func, '__name__')
+else func.__class__.__name__)
+
+@property
+def _judf(self):
+if self._judf_placeholder is None:
--- End diff --

So there isn't any lock around this - I suspect we aren't too likely to 
have concurrent calls to this - but just to be safe we should maybe think 
through what would happen if this is does happen (and then leave a comment 
about it)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-25 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r97908047
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -459,6 +459,23 @@ def filename(path):
 row2 = df2.select(sameText(df2['file'])).first()
 self.assertTrue(row2[0].find("people.json") != -1)
 
+def test_udf_defers_judf_initalization(self):
+from pyspark.sql.functions import UserDefinedFunction
+
+f = UserDefinedFunction(lambda x: x, StringType())
+
+self.assertIsNone(
+f._judf_placeholder,
+"judf should not be initialized before the first call."
+)
+
+self.assertTrue(isinstance(f("foo"), Column), "UDF call should 
return a Column.")
--- End diff --

there is a `assertIsInstance` function that could simplify this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-25 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r97907785
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -459,6 +459,23 @@ def filename(path):
 row2 = df2.select(sameText(df2['file'])).first()
 self.assertTrue(row2[0].find("people.json") != -1)
 
+def test_udf_defers_judf_initalization(self):
--- End diff --

This seems like a good test but maybe a bit too focused on testing the 
implementation specifics?

Maybe it might more sense to also have a test which verifies creating a UDF 
doesn't create a SparkSession since that is the intended purposes (we don't 
really care about delaying the initialization of _judfy that much per-se but we 
do care about verifying that we don't eagerly create the SparkSession on 
import). What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

2017-01-25 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16536#discussion_r97907507
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
 def __init__(self, func, returnType, name=None):
 self.func = func
 self.returnType = returnType
-self._judf = self._create_judf(name)
-
-def _create_judf(self, name):
+self._judf_placeholder = None
--- End diff --

Maybe add a comment explaining the purposes of this, just for future 
readers of the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org