[GitHub] spark pull request #20355: SPARK-23148: [SQL] Allow pathnames with special c...

2018-01-23 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20355#discussion_r163422934
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -68,13 +68,16 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
   }
 
   allFileBasedDataSources.foreach { format =>
-test(s"SPARK-22146 read files containing special characters using 
$format") {
-  val nameWithSpecialChars = s"sp%chars"
-  withTempDir { dir =>
-val tmpFile = s"$dir/$nameWithSpecialChars"
-spark.createDataset(Seq("a", 
"b")).write.format(format).save(tmpFile)
-val fileContent = spark.read.format(format).load(tmpFile)
-checkAnswer(fileContent, Seq(Row("a"), Row("b")))
+test(s"SPARK-22146 / SPARK-23148 read files containing special 
characters using $format") {
+  val nameWithSpecialChars = s"sp%c hars"
+  Seq(true, false).foreach { multiline =>
--- End diff --

Sounds good to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20365
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86546/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20365
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20365
  
**[Test build #86546 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86546/testReport)**
 for PR 20365 at commit 
[`7209792`](https://github.com/apache/spark/commit/72097921f33492160a2784e108d2eb61fa543672).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20368


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20368
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163421661
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -1087,13 +1038,6 @@ def _find_module(mod_name):
 file.close()
 return path, description
 
-def _load_namedtuple(name, fields):
-"""
-Loads a class generated by namedtuple
-"""
-from collections import namedtuple
-return namedtuple(name, fields)
-
--- End diff --

This didn't seem necessary anymore after the fix for namedtuples


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163421434
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -1019,18 +948,40 @@ def __reduce__(cls):
 return cls.__name__
 
 
-def _fill_function(func, globals, defaults, dict, module, closure_values):
-""" Fills in the rest of function data into the skeleton function 
object
-that were created via _make_skel_func().
+def _fill_function(*args):
+"""Fills in the rest of function data into the skeleton function object
+
+The skeleton itself is create by _make_skel_func().
--- End diff --

Restore compatibility with functions pickled with 0.4.0 (#128) 

https://github.com/cloudpipe/cloudpickle/commit/7d8c670b703a683d6fd7e642c6bec8a487594d20


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163421517
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -1019,18 +948,40 @@ def __reduce__(cls):
 return cls.__name__
 
 
-def _fill_function(func, globals, defaults, dict, module, closure_values):
-""" Fills in the rest of function data into the skeleton function 
object
-that were created via _make_skel_func().
+def _fill_function(*args):
+"""Fills in the rest of function data into the skeleton function object
+
+The skeleton itself is create by _make_skel_func().
 """
-func.__globals__.update(globals)
-func.__defaults__ = defaults
-func.__dict__ = dict
-func.__module__ = module
+if len(args) == 2:
+func = args[0]
+state = args[1]
+elif len(args) == 5:
+# Backwards compat for cloudpickle v0.4.0, after which the `module`
+# argument was introduced
+func = args[0]
+keys = ['globals', 'defaults', 'dict', 'closure_values']
+state = dict(zip(keys, args[1:]))
+elif len(args) == 6:
+# Backwards compat for cloudpickle v0.4.1, after which the function
+# state was passed as a dict to the _fill_function it-self.
+func = args[0]
+keys = ['globals', 'defaults', 'dict', 'module', 'closure_values']
+state = dict(zip(keys, args[1:]))
+else:
+raise ValueError('Unexpected _fill_value arguments: %r' % (args,))
+
+func.__globals__.update(state['globals'])
+func.__defaults__ = state['defaults']
+func.__dict__ = state['dict']
+if 'module' in state:
+func.__module__ = state['module']
+if 'qualname' in state:
+func.__qualname__ = state['qualname']
--- End diff --

Preserve func.__qualname__ when defined 
https://github.com/cloudpipe/cloudpickle/commit/14b38a3ab5970d96cce1492c790494932285f845


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163421285
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -913,11 +841,12 @@ def dump(obj, file, protocol=2):
 
 def dumps(obj, protocol=2):
 file = StringIO()
-
-cp = CloudPickler(file,protocol)
-cp.dump(obj)
-
-return file.getvalue()
+try:
+cp = CloudPickler(file,protocol)
+cp.dump(obj)
+return file.getvalue()
+finally:
+file.close()
--- End diff --

Close StringIO timely on exception 
https://github.com/cloudpipe/cloudpickle/commit/ca4661b3a20b635f4c240ef763f5759267d74cb9


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163421163
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -867,23 +797,21 @@ def save_not_implemented(self, obj):
 dispatch[type(Ellipsis)] = save_ellipsis
 dispatch[type(NotImplemented)] = save_not_implemented
 
-# WeakSet was added in 2.7.
-if hasattr(weakref, 'WeakSet'):
-def save_weakset(self, obj):
-self.save_reduce(weakref.WeakSet, (list(obj),))
-
-dispatch[weakref.WeakSet] = save_weakset
+def save_weakset(self, obj):
+self.save_reduce(weakref.WeakSet, (list(obj),))
 
-"""Special functions for Add-on libraries"""
-def inject_addons(self):
-"""Plug in system. Register additional pickling functions if 
modules already loaded"""
-pass
+dispatch[weakref.WeakSet] = save_weakset
 
 def save_logger(self, obj):
 self.save_reduce(logging.getLogger, (obj.name,), obj=obj)
 
 dispatch[logging.Logger] = save_logger
 
+"""Special functions for Add-on libraries"""
+def inject_addons(self):
+"""Plug in system. Register additional pickling functions if 
modules already loaded"""
+pass
+
--- End diff --

Further cleanups 
https://github.com/cloudpipe/cloudpickle/commit/c91aaf110441991307f5097f950764079d0f9652


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163421005
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -754,64 +742,6 @@ def __getattribute__(self, item):
 if type(operator.attrgetter) is type:
 dispatch[operator.attrgetter] = save_attrgetter
 
-def save_reduce(self, func, args, state=None,
-listitems=None, dictitems=None, obj=None):
-# Assert that args is a tuple or None
-if not isinstance(args, tuple):
-raise pickle.PicklingError("args from reduce() should be a 
tuple")
-
-# Assert that func is callable
-if not hasattr(func, '__call__'):
-raise pickle.PicklingError("func from reduce should be 
callable")
-
-save = self.save
-write = self.write
-
-# Protocol 2 special case: if func's name is __newobj__, use NEWOBJ
-if self.proto >= 2 and getattr(func, "__name__", "") == 
"__newobj__":
-cls = args[0]
-if not hasattr(cls, "__new__"):
-raise pickle.PicklingError(
-"args[0] from __newobj__ args has no __new__")
-if obj is not None and cls is not obj.__class__:
-raise pickle.PicklingError(
-"args[0] from __newobj__ args has the wrong class")
-args = args[1:]
-save(cls)
-
-save(args)
-write(pickle.NEWOBJ)
-else:
-save(func)
-save(args)
-write(pickle.REDUCE)
-
-if obj is not None:
-self.memoize(obj)
-
-# More new special cases (that work with older protocols as
-# well): when __reduce__ returns a tuple with 4 or 5 items,
-# the 4th and 5th item should be iterators that provide list
-# items and dict items (as (key, value) tuples), or None.
-
-if listitems is not None:
-self._batch_appends(listitems)
-
-if dictitems is not None:
-self._batch_setitems(dictitems)
-
-if state is not None:
-save(state)
-write(pickle.BUILD)
-
-def save_partial(self, obj):
-"""Partial objects do not serialize correctly in python2.x -- this 
fixes the bugs"""
-self.save_reduce(_genpartial, (obj.func, obj.args, obj.keywords))
-
-if sys.version_info < (2,7):  # 2.7 supports partial pickling
-dispatch[partial] = save_partial
-
-
--- End diff --

Remove save_reduce() override:  It is the exactly the same code as in 
Python 2's Pickler class.

https://github.com/cloudpipe/cloudpickle/commit/2da4c243ceddebbc2febf116eb6e53035fed9b9a


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20371
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163420818
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -709,12 +702,7 @@ def save_property(self, obj):
 dispatch[property] = save_property
 
 def save_classmethod(self, obj):
-try:
-orig_func = obj.__func__
-except AttributeError:  # Python 2.6
--- End diff --

support for Python 2.6 removed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20371
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86550/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20368
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20368
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86545/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #86550 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86550/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163420703
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -608,37 +620,18 @@ def save_global(self, obj, name=None, 
pack=struct.pack):
 The name of this method is somewhat misleading: all types get
 dispatched here.
 """
-if obj.__module__ == "__builtin__" or obj.__module__ == "builtins":
-if obj in _BUILTIN_TYPE_NAMES:
-return self.save_reduce(_builtin_type, 
(_BUILTIN_TYPE_NAMES[obj],), obj=obj)
-
-if name is None:
-name = obj.__name__
-
-modname = getattr(obj, "__module__", None)
-if modname is None:
-try:
-# whichmodule() could fail, see
-# 
https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling
-modname = pickle.whichmodule(obj, name)
-except Exception:
-modname = '__main__'
-
-if modname == '__main__':
-themodule = None
-else:
-__import__(modname)
-themodule = sys.modules[modname]
-self.modules.add(themodule)
+try:
+return Pickler.save_global(self, obj, name=name)
+except Exception:
+if obj.__module__ == "__builtin__" or obj.__module__ == 
"builtins":
+if obj in _BUILTIN_TYPE_NAMES:
+return self.save_reduce(_builtin_type, 
(_BUILTIN_TYPE_NAMES[obj],), obj=obj)
--- End diff --

Some cleanups, fix memoryview support 
https://github.com/cloudpipe/cloudpickle/commit/f8187e90aed7e1b96ffaae85cdf4b37108c75d3f


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20368
  
**[Test build #86545 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86545/testReport)**
 for PR 20368 at commit 
[`21e5321`](https://github.com/apache/spark/commit/21e5321d072c312e243407af08eeb9c1a796ab4d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4075 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4075/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4073 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4073/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4072 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4072/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4074 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4074/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163420474
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -522,17 +529,22 @@ def save_function_tuple(self, func):
 self.memoize(func)
 
 # save the rest of the func data needed by _fill_function
-save(f_globals)
-save(defaults)
-save(dct)
-save(func.__module__)
-save(closure_values)
+state = {
+'globals': f_globals,
+'defaults': defaults,
+'dict': dct,
+'module': func.__module__,
+'closure_values': closure_values,
+}
+if hasattr(func, '__qualname__'):
+state['qualname'] = func.__qualname__
+save(state)
--- End diff --

Preserve func.__qualname__ when defined 
https://github.com/cloudpipe/cloudpickle/commit/14b38a3ab5970d96cce1492c790494932285f845


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4076 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4076/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163420127
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -420,20 +440,18 @@ def save_dynamic_class(self, obj):
 from global modules.
 """
 clsdict = dict(obj.__dict__)  # copy dict proxy to a dict
-if not isinstance(clsdict.get('__dict__', None), property):
-# don't extract dict that are properties
-clsdict.pop('__dict__', None)
-clsdict.pop('__weakref__', None)
-
-# hack as __new__ is stored differently in the __dict__
-new_override = clsdict.get('__new__', None)
-if new_override:
-clsdict['__new__'] = obj.__new__
-
-# namedtuple is a special case for Spark where we use the 
_load_namedtuple function
-if getattr(obj, '_is_namedtuple_', False):
-self.save_reduce(_load_namedtuple, (obj.__name__, obj._fields))
-return
+clsdict.pop('__weakref__', None)
+
+# On PyPy, __doc__ is a readonly attribute, so we need to include 
it in
+# the initial skeleton class.  This is safe because we know that 
the
+# doc can't participate in a cycle with the original class.
+type_kwargs = {'__doc__': clsdict.pop('__doc__', None)}
+
+# If type overrides __dict__ as a property, include it in the type 
kwargs.
+# In Python 2, we can't set this attribute after construction.
+__dict__ = clsdict.pop('__dict__', None)
+if isinstance(__dict__, property):
+type_kwargs['__dict__'] = __dict__
--- End diff --

BUG: Fix bug pickling namedtuple 
https://github.com/cloudpipe/cloudpickle/commit/28070bba79cf71e5719ab8d7c1d6cbc72cd95a0c


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163419942
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -318,6 +329,18 @@ def save_function(self, obj, name=None):
 Determines what kind of function obj is (e.g. lambda, defined at
 interactive prompt, etc) and handles the pickling appropriately.
 """
+if obj in _BUILTIN_TYPE_CONSTRUCTORS:
+# We keep a special-cased cache of built-in type constructors 
at
+# global scope, because these functions are structured very
+# differently in different python versions and implementations 
(for
+# example, they're instances of types.BuiltinFunctionType in
+# CPython, but they're ordinary types.FunctionType instances in
+# PyPy).
+#
+# If the function we've received is in that cache, we just
+# serialize it as a lookup into the cache.
+return self.save_reduce(_BUILTIN_TYPE_CONSTRUCTORS[obj], (), 
obj=obj)
+
--- End diff --

BUG: Hit the builtin type cache for any function 
https://github.com/cloudpipe/cloudpickle/commit/d84980ccaafc7982a50d4e04064011f401f17d1b


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163419648
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -237,28 +262,14 @@ def dump(self, obj):
 if 'recursion' in e.args[0]:
 msg = """Could not pickle object as excessively deep 
recursion required."""
 raise pickle.PicklingError(msg)
-except pickle.PickleError:
-raise
-except Exception as e:
-emsg = _exception_message(e)
-if "'i' format requires" in emsg:
-msg = "Object too large to serialize: %s" % emsg
-else:
-msg = "Could not serialize object: %s: %s" % 
(e.__class__.__name__, emsg)
-print_exec(sys.stderr)
-raise pickle.PicklingError(msg)
-
 
 def save_memoryview(self, obj):
-"""Fallback to save_string"""
-Pickler.save_string(self, str(obj))
+self.save(obj.tobytes())
+dispatch[memoryview] = save_memoryview
--- End diff --

Some cleanups, fix memoryview support 
https://github.com/cloudpipe/cloudpickle/commit/f8187e90aed7e1b96ffaae85cdf4b37108c75d3f


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163419493
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -237,28 +262,14 @@ def dump(self, obj):
 if 'recursion' in e.args[0]:
 msg = """Could not pickle object as excessively deep 
recursion required."""
 raise pickle.PicklingError(msg)
-except pickle.PickleError:
-raise
-except Exception as e:
-emsg = _exception_message(e)
-if "'i' format requires" in emsg:
-msg = "Object too large to serialize: %s" % emsg
-else:
-msg = "Could not serialize object: %s: %s" % 
(e.__class__.__name__, emsg)
-print_exec(sys.stderr)
-raise pickle.PicklingError(msg)
-
--- End diff --

This exception handling is Spark specific, it has been moved to 
serializers.py `CloudPickleSerializer.dumps`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20373#discussion_r163419329
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -181,6 +180,32 @@ def _builtin_type(name):
 return getattr(types, name)
 
 
+def _make__new__factory(type_):
+def _factory():
+return type_.__new__
+return _factory
+
+
+# NOTE: These need to be module globals so that they're pickleable as 
globals.
+_get_dict_new = _make__new__factory(dict)
+_get_frozenset_new = _make__new__factory(frozenset)
+_get_list_new = _make__new__factory(list)
+_get_set_new = _make__new__factory(set)
+_get_tuple_new = _make__new__factory(tuple)
+_get_object_new = _make__new__factory(object)
+
+# Pre-defined set of builtin_function_or_method instances that can be
+# serialized.
+_BUILTIN_TYPE_CONSTRUCTORS = {
+dict.__new__: _get_dict_new,
+frozenset.__new__: _get_frozenset_new,
+set.__new__: _get_set_new,
+list.__new__: _get_list_new,
+tuple.__new__: _get_tuple_new,
+object.__new__: _get_object_new,
+}
+
+
--- End diff --

MAINT: Handle builtin type __new__ attrs: 
https://github.com/cloudpipe/cloudpickle/commit/f0d2011f9fc88105c174b7c861f2c2f56e870350


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to match 0...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20373
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to match 0...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20373
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/161/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to match 0...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20373
  
**[Test build #86553 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86553/testReport)**
 for PR 20373 at commit 
[`c362df8`](https://github.com/apache/spark/commit/c362df87e2d5a5f55d2e4f7d48e24b2d7cfda6f7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20373: [WIP][SPARK-23159][PYTHON] Update cloudpickle to ...

2018-01-23 Thread BryanCutler
GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/20373

[WIP][SPARK-23159][PYTHON] Update cloudpickle to match 0.4.2

## What changes were proposed in this pull request?

The version of cloudpickle in PySpark was close to version 0.4.0 with some 
additional backported fixes and some minor additions for Spark related things.  
With version 0.4.2 we can remove Spark related changes and make the version in 
Spark exactly match 0.4.2 at 
https://github.com/cloudpipe/cloudpickle/tree/v0.4.2

Changes by updating to 0.4.2 include:

* Fix pickling of named tuples 
https://github.com/cloudpipe/cloudpickle/pull/113
* Built in type constructors for PyPy compatibility 
[here](https://github.com/cloudpipe/cloudpickle/commit/d84980ccaafc7982a50d4e04064011f401f17d1b)
* Fix memoryview support https://github.com/cloudpipe/cloudpickle/pull/122
* Improved compatibility with other cloudpickle versions 
https://github.com/cloudpipe/cloudpickle/pull/128
* Several cleanups https://github.com/cloudpipe/cloudpickle/pull/121 and 
[here](https://github.com/cloudpipe/cloudpickle/commit/c91aaf110441991307f5097f950764079d0f9652)

## How was this patch tested?

Existing pyspark.tests using python 2.7.14 and 3.5.2

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
pyspark-update-cloudpickle-42-SPARK-23159

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20373.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20373


commit 89f13b857dba53754f6813efae2d0ca4540c48f4
Author: Bryan Cutler 
Date:   2018-01-23T23:25:29Z

updated cloudpickle to match 0.4.2

commit c362df87e2d5a5f55d2e4f7d48e24b2d7cfda6f7
Author: Bryan Cutler 
Date:   2018-01-23T23:55:25Z

removed unused import




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20371: [SPARK-23197][DStreams] Increased timeouts to res...

2018-01-23 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20371#discussion_r163415477
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/ReceiverSuite.scala ---
@@ -105,13 +105,13 @@ class ReceiverSuite extends TestSuiteBase with 
TimeLimits with Serializable {
 assert(executor.errors.head.eq(exception))
 
 // Verify restarting actually stops and starts the receiver
-receiver.restart("restarting", null, 100)
-eventually(timeout(50 millis), interval(10 millis)) {
+receiver.restart("restarting", null, 600)
--- End diff --

yeah. lets fix one at a time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20355: SPARK-23148: [SQL] Allow pathnames with special c...

2018-01-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20355#discussion_r163414704
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -68,13 +68,16 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
   }
 
   allFileBasedDataSources.foreach { format =>
-test(s"SPARK-22146 read files containing special characters using 
$format") {
-  val nameWithSpecialChars = s"sp%chars"
-  withTempDir { dir =>
-val tmpFile = s"$dir/$nameWithSpecialChars"
-spark.createDataset(Seq("a", 
"b")).write.format(format).save(tmpFile)
-val fileContent = spark.read.format(format).load(tmpFile)
-checkAnswer(fileContent, Seq(Row("a"), Row("b")))
+test(s"SPARK-22146 / SPARK-23148 read files containing special 
characters using $format") {
+  val nameWithSpecialChars = s"sp%c hars"
+  Seq(true, false).foreach { multiline =>
--- End diff --

Less dup is fine but this case slightly confuses like orc and parquet 
support multiline, and runs duplicated tests as you pointed out if I should 
nitpick. I think I prefer a separate test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread sameeragarwal
Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20371
  
LGTM, pending jenkins. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20371: [SPARK-23197][DStreams] Increased timeouts to res...

2018-01-23 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20371#discussion_r163412164
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/ReceiverSuite.scala ---
@@ -105,13 +105,13 @@ class ReceiverSuite extends TestSuiteBase with 
TimeLimits with Serializable {
 assert(executor.errors.head.eq(exception))
 
 // Verify restarting actually stops and starts the receiver
-receiver.restart("restarting", null, 100)
-eventually(timeout(50 millis), interval(10 millis)) {
+receiver.restart("restarting", null, 600)
--- End diff --

If these timeout bumps fix the flakiness, we should also consider enabling 
the "block generator throttling" test below (it was disabled via 
https://github.com/apache/spark/commit/b69c4f9b2e8544f1b178db2aefbcaa166f76cb7a)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: SPARK-23148: [SQL] Allow pathnames with special characte...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20355
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: SPARK-23148: [SQL] Allow pathnames with special characte...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20355
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/160/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20355: SPARK-23148: [SQL] Allow pathnames with special c...

2018-01-23 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20355#discussion_r163411267
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
@@ -172,6 +172,14 @@ class TextSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-23148: test for spaces in file names") {
--- End diff --

In the end, to reduce code duplication, I made it so that orc and parquet 
run multiline as well (I tried to find a neat way to only run multiline if the 
format was csv, text or json without having a separate test case but it just 
complicated things). Let me know if you'd rather I have two separate test cases 
to avoid running the two redundant cases with orc / parquet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20360: [SPARK-23177][SQL][PySpark] Extract zero-parameter UDFs ...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20360
  
**[Test build #86551 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86551/testReport)**
 for PR 20360 at commit 
[`74684a7`](https://github.com/apache/spark/commit/74684a7d10009ef970d7d674d9c695b695c5da5c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: SPARK-23148: [SQL] Allow pathnames with special characte...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20355
  
**[Test build #86552 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86552/testReport)**
 for PR 20355 at commit 
[`740def4`](https://github.com/apache/spark/commit/740def4c9a96a7dba5a8f57c49042dee661608b6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20360: [SPARK-23177][SQL][PySpark] Extract zero-parameter UDFs ...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20360
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/159/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20360: [SPARK-23177][SQL][PySpark] Extract zero-parameter UDFs ...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20360
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20360: [SPARK-23177][SQL][PySpark] Extract zero-paramete...

2018-01-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20360#discussion_r163410447
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
 ---
@@ -45,7 +45,8 @@ object ExtractPythonUDFFromAggregate extends 
Rule[LogicalPlan] {
 
   private def hasPythonUdfOverAggregate(expr: Expression, agg: Aggregate): 
Boolean = {
 expr.find {
-  e => PythonUDF.isScalarPythonUDF(e) && e.find(belongAggregate(_, 
agg)).isDefined
--- End diff --

Yes. Updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20372
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20372
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread glentakahashi
GitHub user glentakahashi opened a pull request:

https://github.com/apache/spark/pull/20372

Improved block merging logic for partitions

## What changes were proposed in this pull request?

Change DataSourceScanExec so that when grouping blocks together into 
partitions, also checks the end of the sorted list of splits to more 
efficiently fill out partitions.

## How was this patch tested?

Updated old test to reflect the new logic, which causes the # of partitions 
to drop from 4 -> 3


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/glentakahashi/spark 
feature/improved-block-merging

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20372


commit c575977a5952bf50b605be8079c9be1e30f3bd36
Author: Glen Takahashi 
Date:   2018-01-23T23:22:34Z

Improved block merging logic for partitions




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163404464
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -63,7 +63,7 @@ case class InMemoryRelation(
 tableName: Option[String])(
 @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
 val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
-statsOfPlanToCache: Statistics = null)
+statsOfPlanToCache: Statistics)
--- End diff --

Setting `null` by default is risky, because we might hit 
`NullPointerException `. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cac...

2018-01-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20365


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20370
  
Hi, @conorbmurphy .
Could you add a test case for your contribution, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20365
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread conorbmurphy
Github user conorbmurphy commented on the issue:

https://github.com/apache/spark/pull/20370
  
Will do!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20365
  
Since the last change is just to change the test case name, I merge this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20371
  
@sameeragarwal this PR should fix this flakiness.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20370
  
@conorbmurphy Could you create a JIRA and follow [the 
instruction](https://spark.apache.org/contributing.html) to make a 
contribution? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20370#discussion_r163403070
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -78,7 +78,8 @@ private[sql] object JDBCRelation extends Logging {
 // Overflow and silliness can happen if you subtract then divide.
 // Here we get a little roundoff, but that's (hopefully) OK.
 val stride: Long = upperBound / numPartitions - lowerBound / 
numPartitions
-val column = partitioning.column
+val dialect = JdbcDialects.get(jdbcOptions.url)
+val column = dialect.quoteIdentifier(partitioning.column)
--- End diff --

We also need to add a test case in `PostgresIntegrationSuite`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #86550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86550/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20370#discussion_r163402979
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -78,7 +78,8 @@ private[sql] object JDBCRelation extends Logging {
 // Overflow and silliness can happen if you subtract then divide.
 // Here we get a little roundoff, but that's (hopefully) OK.
 val stride: Long = upperBound / numPartitions - lowerBound / 
numPartitions
-val column = partitioning.column
+val dialect = JdbcDialects.get(jdbcOptions.url)
+val column = dialect.quoteIdentifier(partitioning.column)
--- End diff --

We should do it in `class JDBCOptions`. To avoid breaking the behavior, we 
should eat the quotes if users manually specify them. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4075 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4075/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20370
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4076 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4076/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4074 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4074/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20371
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/158/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4073 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4073/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20371
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20371: [SPARK-23197][DStreams] Increased timeouts to resolve fl...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20371
  
**[Test build #4072 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4072/testReport)**
 for PR 20371 at commit 
[`2446aa0`](https://github.com/apache/spark/commit/2446aa070efe43a6ab0d8adbecca335e94896a0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20370
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20371: [SPARK-23197][DStreams] Increased timeouts to res...

2018-01-23 Thread tdas
GitHub user tdas opened a pull request:

https://github.com/apache/spark/pull/20371

[SPARK-23197][DStreams] Increased timeouts to resolve flakiness

## What changes were proposed in this pull request?

Increased timeout from 50 ms to 300 ms (50 ms was really too low). 

## How was this patch tested?
Multiple rounds of tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tdas/spark SPARK-23197

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20371.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20371


commit 2446aa070efe43a6ab0d8adbecca335e94896a0b
Author: Tathagata Das 
Date:   2018-01-23T22:49:20Z

increased timeouts




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20365
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20365
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86544/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20370: Changing JDBC relation to better process quotes

2018-01-23 Thread conorbmurphy
GitHub user conorbmurphy opened a pull request:

https://github.com/apache/spark/pull/20370

Changing JDBC relation to better process quotes

## What changes were proposed in this pull request?

The way JDBC writes currently work, they do not properly account for mixed 
case column names.  Instead, the user has to use quotes on each column name.  
This change avoids that.

## How was this patch tested?

Manual tests and working with @dougbateman and @gatorsmile 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/conorbmurphy/spark-1 master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20370.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20370


commit d2864d06c039cb0c0b0c9d9271c9757309017e1a
Author: conorbmurphy 
Date:   2018-01-23T22:43:32Z

Changing JDBC relation to better process quotes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163402210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -63,7 +63,7 @@ case class InMemoryRelation(
 tableName: Option[String])(
 @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
 val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
-statsOfPlanToCache: Statistics = null)
+statsOfPlanToCache: Statistics)
--- End diff --

leave no default value is fine, we do not any default value actually


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20365: [SPARK-23192] [SQL] Keep the Hint after Using Cached Dat...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20365
  
**[Test build #86544 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86544/testReport)**
 for PR 20365 at commit 
[`1186ef5`](https://github.com/apache/spark/commit/1186ef5e38a34ff77fa62521de0da73666b0de96).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

2018-01-23 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19864#discussion_r163401858
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -60,7 +62,8 @@ case class InMemoryRelation(
 @transient child: SparkPlan,
 tableName: Option[String])(
 @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
-val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator)
+val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
+statsOfPlanToCache: Statistics = null)
--- End diff --

eh...we do not have other options, it's more like a placeholder, since 
InMemoryRelation is created by CacheManager through apply() in companion object 
it's no harm here IMHO


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20369
  
**[Test build #86549 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86549/testReport)**
 for PR 20369 at commit 
[`d722bbf`](https://github.com/apache/spark/commit/d722bbf2f253dff0b7da0111b4e75529dc591813).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20369
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20369: [SPARK-23196] Unify continuous and microbatch V2 ...

2018-01-23 Thread jose-torres
GitHub user jose-torres opened a pull request:

https://github.com/apache/spark/pull/20369

[SPARK-23196] Unify continuous and microbatch V2 sinks

## What changes were proposed in this pull request?

Replace streaming V2 sinks with a unified StreamWriteSupport interface, 
with a shim to use it with microbatch execution.

Add a new SQL config to use for disabling V2 sinks, falling back to the V1 
sink implementation.

## How was this patch tested?

Existing tests, which in the case of Kafka (the only existing continuous V2 
sink) now use V2 for microbatch.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jose-torres/spark streaming-sink

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20369


commit 94c06a5f9a9d88810a43ac66722f58ffa45709f0
Author: Jose Torres 
Date:   2018-01-23T20:47:44Z

change sink

commit ee773f4cc7d6cfbb14b40c2e7961386ea2742612
Author: Jose Torres 
Date:   2018-01-23T21:12:20Z

add config

commit d722bbf2f253dff0b7da0111b4e75529dc591813
Author: Jose Torres 
Date:   2018-01-23T22:07:11Z

fix internal row




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20285
  
Hi, @mengxr .
Could you resolve the JIRA, too?
- https://issues.apache.org/jira/browse/SPARK-22735

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

2018-01-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20285


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20285
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20285
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86548/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20285
  
**[Test build #86548 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86548/testReport)**
 for PR 20285 at commit 
[`3055eec`](https://github.com/apache/spark/commit/3055eec72bb71e7fe7d586903fbf8ea57a70fa82).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/20285
  
LGTM. Merged into master and branch-2.3. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

2018-01-23 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/20285#discussion_r163390328
  
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
 
 
 
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors 
for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from 
its input columns to
+produce size information and metadata for its output column. While in some 
cases this information
+can be obtained by inspecting the contents of the column, in a streaming 
dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user 
to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers 
that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` 
parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata 
for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get 
this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which 
controls its
+behaviour when the vector column contains nulls or vectors of the wrong 
size. By default
+`handleInvalid` is set to "error", indicating an exception should be 
thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values 
should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should 
be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for 
ensuring that the column
+does not have invalid values, values that don't match the column's 
metadata, or dealing with those
+invalid values downstream.
+
+
+
+
+Refer to the [VectorSizeHint Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)
--- End diff --

https://user-images.githubusercontent.com/223219/35302985-523f-0045-11e8-9a21-c4ed795b6e6a.png;>

I don't think so :), but I think we should leave it to be consistent with 
other examples.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20285
  
**[Test build #86548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86548/testReport)**
 for PR 20285 at commit 
[`3055eec`](https://github.com/apache/spark/commit/3055eec72bb71e7fe7d586903fbf8ea57a70fa82).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20285
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/157/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs and exa...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20285
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20285: [SPARK-22735][ML][DOC] Added VectorSizeHint docs ...

2018-01-23 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/20285#discussion_r163389908
  
--- Diff: docs/ml-features.md ---
@@ -1283,6 +1283,56 @@ for more details on the API.
 
 
 
+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors 
for a column of
+`VectorType`. For example, `VectorAssembler` uses size information from 
its input columns to
+produce size information and metadata for its output column. While in some 
cases this information
+can be obtained by inspecting the contents of the column, in a streaming 
dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user 
to explicitly specify the
+vector size for a column so that `VectorAssembler`, or other transformers 
that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` 
parameters. Applying this
+transformer to a dataframe produces a new dataframe with updated metadata 
for `inputCol` specifying
+the vector size. Downstream operations on the resulting dataframe can get 
this size using the
+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which 
controls its
+behaviour when the vector column contains nulls or vectors of the wrong 
size. By default
+`handleInvalid` is set to "error", indicating an exception should be 
thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values 
should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should 
be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for 
ensuring that the column
--- End diff --

I've updated it, let me know if you think we can still make it more clear.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20335: [SPARK-23088][CORE] History server not showing incomplet...

2018-01-23 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20335
  
This is a behavior change, may be useful under your cases, but we have to 
make sure it doesn't cause any regressions in other scenarios.  cc 
@gengliangwang 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20361
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/156/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20361
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread sameeragarwal
Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20368
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20361
  
**[Test build #86547 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86547/testReport)**
 for PR 20361 at commit 
[`38debd7`](https://github.com/apache/spark/commit/38debd7957fc2376b92cac5ae6ad1b0b78fb33c2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20361
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20361
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86542/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20361
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20361
  
**[Test build #86542 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86542/testReport)**
 for PR 20361 at commit 
[`38debd7`](https://github.com/apache/spark/commit/38debd7957fc2376b92cac5ae6ad1b0b78fb33c2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >