from:"HyukjinKwon"

[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23063
  
Let me leave a cc for @holdenk, @MLnick, @jkbradley and @mengxr FYI.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
@zsxwing sure. Sorry that I rushed. Will do next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
I think now it should be good timing to match the behaviours.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23056
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23056
  
@BryanCutler, let me merge this. Let's do the ML one and then clean up both 
comments throughout ML and MLlib at once.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
related another try https://github.com/apache/spark/pull/13252


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
One try to add some tests for reading/writing empty dataframes was here 
https://github.com/apache/spark/pull/13253 fyi


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
Which should be ... this https://github.com/apache/spark/pull/12855


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23054
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-16 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23056
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23056
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22309
  
adding @liancheng BTW. IIRC, he took a look for this one before and 
abandoned the change (fix me if I'm wrongly remembering this).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r234086569
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

> JVM could set the request

This is handled in JVM so it wouldn't break. `worker` itself is strongly 
coupled to JVM.

You mean that case when the client is in Windows machine and it uses a 
Unix-based clusters, right? I think this is what the fix already does - the 
`PythonRunner`s already are created at executor side and it wouldn't affect 
when the client is on Windows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r234081475
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

I see. I think the point of view is a bit different. What I was trying to 
do is that:
we declare this configuration is not supported on Windows, meaning we 
disable this configuration on Windows from the start, JVM side - because it's 
JVM to launch Python workers. So, I was trying to leave the control to JVM.

> It seems brittle to disable this on the JVM side and rely on it here. Can 
we also set a flag in the ImportError case and also check that here?

However, in a way, It's a bit odd to say it's brittle because we're already 
relying on that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py ...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23056
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/te...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23056#discussion_r234080468
  
--- Diff: python/pyspark/mllib/tests/test_linalg.py ---
@@ -0,0 +1,642 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sys
+import array as pyarray
+
+from numpy import array, array_equal, zeros, arange, tile, ones, inf
+from numpy import sum as array_sum
+
+if sys.version_info[:2] <= (2, 6):
+try:
+import unittest2 as unittest
+except ImportError:
+sys.stderr.write('Please install unittest2 to test with Python 2.6 
or earlier')
+sys.exit(1)
+else:
+import unittest
+
+import pyspark.ml.linalg as newlinalg
+from pyspark.mllib.linalg import Vector, SparseVector, DenseVector, 
VectorUDT, _convert_to_vector, \
+DenseMatrix, SparseMatrix, Vectors, Matrices, MatrixUDT
+from pyspark.mllib.regression import LabeledPoint
+from pyspark.testing.mllibutils import make_serializer, MLlibTestCase
+
+_have_scipy = False
+try:
+import scipy.sparse
+_have_scipy = True
+except:
+# No SciPy, but that's okay, we'll skip those tests
+pass
+
+
+ser = make_serializer()
+
+
+def _squared_distance(a, b):
+if isinstance(a, Vector):
+return a.squared_distance(b)
+else:
+return b.squared_distance(a)
+
+
+class VectorTests(MLlibTestCase):
+
+def _test_serialize(self, v):
+self.assertEqual(v, ser.loads(ser.dumps(v)))
+jvec = 
self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.loads(bytearray(ser.dumps(v)))
+nv = 
ser.loads(bytes(self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.dumps(jvec)))
+self.assertEqual(v, nv)
+vs = [v] * 100
+jvecs = 
self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.loads(bytearray(ser.dumps(vs)))
+nvs = 
ser.loads(bytes(self.sc._jvm.org.apache.spark.mllib.api.python.SerDe.dumps(jvecs)))
+self.assertEqual(vs, nvs)
+
+def test_serialize(self):
+self._test_serialize(DenseVector(range(10)))
+self._test_serialize(DenseVector(array([1., 2., 3., 4.])))
+self._test_serialize(DenseVector(pyarray.array('d', range(10
+self._test_serialize(SparseVector(4, {1: 1, 3: 2}))
+self._test_serialize(SparseVector(3, {}))
+self._test_serialize(DenseMatrix(2, 3, range(6)))
+sm1 = SparseMatrix(
+3, 4, [0, 2, 2, 4, 4], [1, 2, 1, 2], [1.0, 2.0, 4.0, 5.0])
+self._test_serialize(sm1)
+
+def test_dot(self):
+sv = SparseVector(4, {1: 1, 3: 2})
+dv = DenseVector(array([1., 2., 3., 4.]))
+lst = DenseVector([1, 2, 3, 4])
+mat = array([[1., 2., 3., 4.],
+ [1., 2., 3., 4.],
+ [1., 2., 3., 4.],
+ [1., 2., 3., 4.]])
+arr = pyarray.array('d', [0, 1, 2, 3])
+self.assertEqual(10.0, sv.dot(dv))
+self.assertTrue(array_equal(array([3., 6., 9., 12.]), sv.dot(mat)))
+self.assertEqual(30.0, dv.dot(dv))
+self.assertTrue(array_equal(array([10., 20., 30., 40.]), 
dv.dot(mat)))
+self.assertEqual(30.0, lst.dot(dv))
+self.assertTrue(array_equal(array([10., 20., 30., 40.]), 
lst.dot(mat)))
+self.assertEqual(7.0, sv.dot(arr))
+
+def test_squared_distance(self):
+sv = SparseVector(4, {1: 1, 3: 2})
+dv = DenseVector(array([1., 2., 3., 4.]))
+lst = DenseVector([4, 3, 2, 1])
+lst1 = [4, 3, 2, 1]
+arr = pyarray.array('d', [0, 2, 1, 3])
+narr = array([0, 2, 1, 3])
+self.assertEqual(15.0, _squared_distance(sv, dv))
+self.assertEqual(25.0, _squared_distance(sv, lst))
+self.assertEqual(20.0, _squared_distance(dv, lst))
+self.assertEqual(15.0, _

[GitHub] spark pull request #23056: [SPARK-26034][PYTHON][TESTS] Break large mllib/te...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23056#discussion_r234080249
  
--- Diff: python/pyspark/testing/mllibutils.py ---
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sys
+
+if sys.version_info[:2] <= (2, 6):
+try:
+import unittest2 as unittest
+except ImportError:
+sys.stderr.write('Please install unittest2 to test with Python 2.6 
or earlier')
+sys.exit(1)
+else:
+import unittest
--- End diff --

@BryanCutler, actually I remove this because we dropped 2.6 support while 
we are here. Im pretty sure we can just import unittest.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23046: [SPARK-23207][SQL][FOLLOW-UP] Use `SQLConf.get.en...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23046#discussion_r234073703
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 ---
@@ -280,7 +280,7 @@ object ShuffleExchangeExec {
   }
   // The comparator for comparing row hashcode, which should 
always be Integer.
   val prefixComparator = PrefixComparators.LONG
-  val canUseRadixSort = 
SparkEnv.get.conf.get(SQLConf.RADIX_SORT_ENABLED)
+  val canUseRadixSort = SQLConf.get.enableRadixSort
--- End diff --

Yes .. I don't mind it but was just thinking that we don't necessarily 
backport to all the branches if there's any concern. I will leave it to you 
guys as well. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23055: [SPARK-26080][SQL] Disable 'spark.executor.pyspark.memor...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
cc @rdblue, @vanzin and @haydenjeune


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][SQL] Disable 'spark.executor.pyspar...

2018-11-15 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23055

[SPARK-26080][SQL] Disable 'spark.executor.pyspark.memory' always on Windows

## What changes were proposed in this pull request?

`resource` package is a Unit specific package. See 
https://docs.python.org/2/library/resource.html and 
https://docs.python.org/3/library/resource.html.

Note that we document Windows support:

> Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). 

This should be backported into branch-2.4 to restore Windows support in 
Spark 2.4.1.

## How was this patch tested?

Manually mocking the changed logics.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23055.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23055


commit 2d3315a7dab429abc4d9ef5ed7f8f5484e8421f1
Author: hyukjinkwon 
Date:   2018-11-16T01:46:31Z

Disable 'spark.executor.pyspark.memory' on Windows always




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
Thank you @BryanCutler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23046: [SPARK-23207][SQL][FOLLOW-UP] Use `SQLConf.get.en...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23046#discussion_r234063905
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 ---
@@ -280,7 +280,7 @@ object ShuffleExchangeExec {
   }
   // The comparator for comparing row hashcode, which should 
always be Integer.
   val prefixComparator = PrefixComparators.LONG
-  val canUseRadixSort = 
SparkEnv.get.conf.get(SQLConf.RADIX_SORT_ENABLED)
+  val canUseRadixSort = SQLConf.get.enableRadixSort
--- End diff --

@ueshin, BTW, for clarification, it does read the configuration but does 
not respect when the configuration is given to the session, right? I think we 
don't need to backport this through all other branches.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23052#discussion_r234062564
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -174,13 +174,18 @@ private[csv] class CsvOutputWriter(
 context: TaskAttemptContext,
 params: CSVOptions) extends OutputWriter with Logging {
 
-  private val charset = Charset.forName(params.charset)
+  private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  private val writer = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-
-  private val gen = new UnivocityGenerator(dataSchema, writer, params)
+  override def write(row: InternalRow): Unit = {
+val gen = univocityGenerator.getOrElse {
--- End diff --

Also, one thing we should not forget about is, CSV _could_ have headers 
even if the records are empty.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
@MaxGekk, actually this is kind of important behaviour change. This 
basically means we're unable to read the empty files back. Similar changes were 
proposed in Parquet few years ago (by me) and reverted.

We should better investigate and match the behaviours first across 
datasources. IIRC, ORC does not create files (if that's not updated from what I 
have checked long ago) but Parquet does.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23047: [BACKPORT][SPARK-25883][SQL][MINOR] Override method `pre...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23047
  
Merged to branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
Also, @BryanCutler, I think we can talk about locations of 
`testing/...util.py` later when we finished to split the tests. Moving utils 
would probably cause less conflicts and should be good enough to separately 
discuss if that's a worry, and should be changed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [SPARK-26035][PYTHON] Break large streaming/tests.py fil...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
@BryanCutler, should be ready to work on ML and MLlib as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23034: [SPARK-26035][PYTHON] Break large streaming/tests...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23034#discussion_r233829511
  
--- Diff: python/pyspark/testing/streamingutils.py ---
@@ -0,0 +1,189 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import glob
+import os
+import tempfile
+import time
+import unittest
+
+from pyspark import SparkConf, SparkContext, RDD
+from pyspark.streaming import StreamingContext
+
+
+def search_kinesis_asl_assembly_jar():
+kinesis_asl_assembly_dir = os.path.join(
+os.environ["SPARK_HOME"], "external/kinesis-asl-assembly")
+
+# We should ignore the following jars
+ignored_jar_suffixes = ("javadoc.jar", "sources.jar", 
"test-sources.jar", "tests.jar")
+
+# Search jar in the project dir using the jar name_prefix for both sbt 
build and maven
+# build because the artifact jars are in different directories.
+name_prefix = "spark-streaming-kinesis-asl-assembly"
+sbt_build = glob.glob(os.path.join(
+kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix))
+maven_build = glob.glob(os.path.join(
+kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix))
+jar_paths = sbt_build + maven_build
+jars = [jar for jar in jar_paths if not 
jar.endswith(ignored_jar_suffixes)]
+
+if not jars:
+return None
+elif len(jars) > 1:
+raise Exception(("Found multiple Spark Streaming Kinesis ASL 
assembly JARs: %s; please "
+ "remove all but one") % (", ".join(jars)))
+else:
+return jars[0]
+
+
+# Must be same as the variable and condition defined in 
KinesisTestUtils.scala and modules.py
+kinesis_test_environ_var = "ENABLE_KINESIS_TESTS"
+should_skip_kinesis_tests = not os.environ.get(kinesis_test_environ_var) 
== '1'
+kinesis_asl_assembly_jar = search_kinesis_asl_assembly_jar()
+
+if should_skip_kinesis_tests:
--- End diff --

I simplified this logic and tested each if-else branch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
Will go and merge this tomorrow if there's no outstanding issues.

cc @zsxwing and @tdas.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
Merged to master.

Thanks @felixcheung.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...

2018-11-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23033
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23033
  
I am merging this for the same reason with #23021. Let me know if there's 
any concern even after this got merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23033
  
@BryanCutler, looks we should add `pyspark.ml.tests` at 
https://github.com/apache/spark/blob/master/python/run-tests.py#L252-L253 so 
that we can run unittests first over doc tests (because arguably unittests take 
longer then doctests).

I think it's missed when `ml/tests.py` was added. For instance, the latest 
above took it took few minutes longer then usual because the ml tests ran at 
the last.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hi...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20788#discussion_r233678942
  
--- Diff: python/pyspark/sql/tests/test_dataframe.py ---
@@ -375,6 +375,19 @@ def test_generic_hints(self):
 plan = df1.join(df2.hint("broadcast"), 
"id")._jdf.queryExecution().executedPlan()
 self.assertEqual(1, plan.toString().count("BroadcastHashJoin"))
 
+# add tests for SPARK-23647 (test more types for hint)
+def test_extended_hint_types(self):
+from pyspark.sql import DataFrame
+
+df = self.spark.range(10e10).toDF("id")
+such_a_nice_list = ["itworks1", "itworks2", "itworks3"]
--- End diff --

@DylanGuedes, what do we get if `dict` is given? looks not being tested. If 
that produces a weird result, we can disallow it here for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hint in p...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20788
  
Thanks. @DylanGuedes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23014
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23039: [SPARK-26066][SQL] Moving truncatedString to sql/catalys...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23039
  
@MaxGekk, I think the main purpose of this PR is rather to introduce 
`spark.sql.debug.maxToStringFields` .. let's fix PR description and title for 
that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
Yup, will address the other comments and update the PR accordingly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23033
  
Yup will do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/tests.p...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23034
  
I haven't tested the kinesis logic yet. I will check it via Jenkins.

Line counts:

```
 751 ./test_dstream.py
  89 ./test_kinesis.py
 158 ./test_listener.py
 184 ./test_context.py
```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23034: [WIP][SPARK-26035][PYTHON] Break large streaming/...

2018-11-14 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23034

[WIP][SPARK-26035][PYTHON] Break large streaming/tests.py files into 
smaller files

## What changes were proposed in this pull request?

This PR continues to break down a big large file into smaller files. See 
https://github.com/apache/spark/pull/23021. It targets to follow 
https://github.com/numpy/numpy/tree/master/numpy.

Basically this PR proposes to break down `pyspark/streaming/tests.py` into 
...:

```
pyspark
âââ __init__.py
...
âââ streaming
âÂ Â  âââ __init__.py
...
âÂ Â  âââ tests
âÂ Â  âÂ Â  âââ __init__.py
âÂ Â  âÂ Â  âââ test_context.py
âÂ Â  âÂ Â  âââ test_dstream.py
âÂ Â  âÂ Â  âââ test_kinesis.py
âÂ Â  âÂ Â  âââ test_listener.py
...
âââ testing
...
âÂ Â  âââ streamingutils.py
...
```


## How was this patch tested?

Existing tests should cover.

`cd python` and .`/run-tests-with-coverage`. Manually checked they are 
actually being ran.

Each test (not officially) can be ran via:

```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```

Note that if you're using Mac and Python 3, you might have to 
`OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26035

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23034.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23034


commit 6c02003b947a97944f6a48e57be5b7c98dbda896
Author: hyukjinkwon 
Date:   2018-11-14T15:15:56Z

[SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller 
files




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21914: [SPARK-24967][SQL] Avro: Use internal.Logging instead fo...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21914
  
Please ask that to the mailing list.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23033
  
Rough line distributions look like this:

```
 237 ./test_serializers.py
 739 ./test_rdd.py
 499 ./test_readwrite.py
  69 ./test_join.py
 161 ./test_taskcontext.py
  43 ./test_conf.py
 122 ./test_broadcast.py
  80 ./test_daemon.py
  86 ./test_util.py
 157 ./test_worker.py
 112 ./test_profiler.py
 181 ./test_shuffle.py
 258 ./test_context.py
 248 ./test_appsubmit.py
```





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23033: [SPARK-26036][PYTHON] Break large tests.py files into sm...

2018-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23033
  
cc'ing @BryanCutler and @squito.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23033: [SPARK-26036][PYTHON] Break large tests.py files ...

2018-11-14 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23033

[SPARK-26036][PYTHON] Break large tests.py files into smaller files

## What changes were proposed in this pull request?

This PR continues to break down a big large file into smaller files. See 
https://github.com/apache/spark/pull/23021. It targets to follow 
https://github.com/numpy/numpy/tree/master/numpy.

Basically this PR proposes to break down `pyspark/tests.py` into ...:

```
pyspark
...
âââ testing
...
âÂ Â  âââ utils.py
âââ tests
âÂ Â  âââ __init__.py
âÂ Â  âââ test_appsubmit.py
âÂ Â  âââ test_broadcast.py
âÂ Â  âââ test_conf.py
âÂ Â  âââ test_context.py
âÂ Â  âââ test_daemon.py
âÂ Â  âââ test_join.py
âÂ Â  âââ test_profiler.py
âÂ Â  âââ test_rdd.py
âÂ Â  âââ test_readwrite.py
âÂ Â  âââ test_serializers.py
âÂ Â  âââ test_shuffle.py
âÂ Â  âââ test_taskcontext.py
âÂ Â  âââ test_util.py
âÂ Â  âââ test_worker.py
...
```


## How was this patch tested?

Existing tests should cover.

`cd python` and .`/run-tests-with-coverage`. Manually checked they are 
actually being ran.

Each test (not officially) can be ran via:

```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```

Note that if you're using Mac and Python 3, you might have to 
`OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26036

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23033.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23033


commit 08cb59eeda3fd3b1042013f72f6fc45ea1146bd1
Author: hyukjinkwon 
Date:   2018-11-14T10:16:13Z

Break large tests.py files into smaller files




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
I am merging this in - maybe I am rushing it but please allow me to go 
ahead since it's going to block other PySpark PRs.

At worst case, I am willing to revert and propose this again if there's 
something obviously wrong. Please feel free to leave some comments even after 
this is merged. I would appreciate it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
Ah .. right makes sense to me. Thanks @shaneknapp. +1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
adding @holdenk, @ueshin and @icexelloss as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
adding @icexelloss as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
> Did you test on python3 as well?

Of course!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r233292436
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
+stop("Arrow optimization with R DataFrame does not support 
POSIXct type yet.")
+  }
+  if (any(sapply(dataHead, is.raw))) {
+stop("Arrow optimization with R DataFrame does not support raw 
type yet.")
+  }
+  if (inherits(schema, "structType")) {
+if (any(sapply(schema$fields(), function(x) 
x$dataType.toString() == "FloatType"))) {
+  stop("Arrow optimization with R DataFrame does not support 
FloatType type yet.")
--- End diff --

I think it's a bug because it always produces a corrupt value when I try to 
use `number` as explicit float types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
@shaneknapp, do you roughly know how difficult it is (and do you have some 
time shortly) to upgrade R from 3.1 to 3.4? I am asking this because I had some 
difficulties when I tried to manually upgrade from a certain low version to 
another non-latest version.

If it's expected to take a while, let's go deprecation step.
If that's expected to be less difficult, let's go saying unsupporting way.

Does this sound okay to you @felixcheung?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23012#discussion_r233290797
  
--- Diff: docs/index.md ---
@@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. 
Linux, Mac OS). It's easy
 locally on one machine --- all you need is to have `java` installed on 
your system `PATH`,
 or the `JAVA_HOME` environment variable pointing to a Java installation.
 
-Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, 
Spark {{site.SPARK_VERSION}}
+Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 
is deprecated as of Spark 3.0.
--- End diff --

At least Python version was noted here before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
> Could you add some descriptions to run a single test file or a single 
test case if exists?

Done!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
Yup!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22994: [BUILD] refactor dev/lint-python in to something readabl...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22994
  
I haven't taken a look super closely but the idea looks itself okay. Is it 
urgent? if yes, yup. I don't object to go ahead right away. Otherwise, might be 
good to leave open for few days for review comments .. 

Let me leave some cc's for @srowen, @felixcheung, @holdenk ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
I am going to push after testing and double checking. The line counts would 
look like this

```
  54 ./test_utils.py
 199 ./test_catalog.py
 503 ./test_grouped_agg_pandas_udf.py
  45 ./test_group.py
 320 ./test_session.py
 153 ./test_readwriter.py
 806 ./test_scalar_pandas_udf.py
 216 ./test_pandas_udf.py
 566 ./test_streaming.py
  55 ./test_conf.py
  16 ./__init__.py
 530 ./test_grouped_map_pandas_udf.py
 157 ./test_column.py
 654 ./test_udf.py
 262 ./test_window_pandas_udf.py
 278 ./test_functions.py
 263 ./test_context.py
 138 ./test_serde.py
 170 ./test_datasources.py
 399 ./test_arrow.py
  96 ./test_appsubmit.py
 944 ./test_types.py
 737 ./test_dataframe.py
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
> I'd break the pandas udf one into smaller pieces too, as you suggested. 
We should also investigate why the runtime didn't improve ...

One suspection from my investigation is, it requires to stop and start the 
context for each test which is costly. I also expected it's going to slightly 
decrease the time but actually it looks that slightly increased the time (I 
guess 2 ~ 3 mins in total? - shouldn't be a big deal).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23021: [SPARK-26032][PYTHON] Break large sql/tests.py fi...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23021#discussion_r233269827
  
--- Diff: python/pyspark/testing/sqlutils.py ---
@@ -0,0 +1,268 @@
+#
--- End diff --

Yea, similar thought. One thing is though testing util itself has some 
dependencies to each others (for instance, `from pyspark.tests import 
ReusedPySparkTestCase`). So, I thought it's okay to make it separate.

Another benefit from doing so is we can simply refer the example (NumPy) as 
is. As you already know, NumPy has been there more then 15 years and should be 
mature enough. It should be okay to trust their approach.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
Yup, will break pandas one into smaller ones as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22962
  
Also please fix the test. The test doesn't really look clear. I actually 
quite didn't like the test written here now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22962#discussion_r233131375
  
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
 @classmethod
 def _getOrCreate(cls):
 """Internal function to get or create global BarrierTaskContext."""
-if cls._taskContext is None:
-cls._taskContext = BarrierTaskContext()
+if not isinstance(cls._taskContext, BarrierTaskContext):
+cls._taskContext = object.__new__(cls)
--- End diff --

Also, next time please fully describe what's going on in PR description. 
Even I was confused about it and misread that `__new__` is actually being 
inherited.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22962#discussion_r233130494
  
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
 @classmethod
 def _getOrCreate(cls):
 """Internal function to get or create global BarrierTaskContext."""
-if cls._taskContext is None:
-cls._taskContext = BarrierTaskContext()
+if not isinstance(cls._taskContext, BarrierTaskContext):
+cls._taskContext = object.__new__(cls)
--- End diff --

Also, we should remove __init__ too. That's what Python interpretor will 
implicitly insert for both.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22962#discussion_r233130221
  
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
 @classmethod
 def _getOrCreate(cls):
 """Internal function to get or create global BarrierTaskContext."""
-if cls._taskContext is None:
-cls._taskContext = BarrierTaskContext()
+if not isinstance(cls._taskContext, BarrierTaskContext):
+cls._taskContext = object.__new__(cls)
--- End diff --

Can we get rid of the rewrite all? It's never a good idea to overwrite them 
unless it's absolutely required.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
Elapsed time looks virtually same. All tests looks running fine. The last 
commit should show skipped tests fine as well. Should be ready for a look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23004#discussion_r233012392
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
   if list.forall(ExtractableLiteral.unapply(_).isDefined) && 
list.nonEmpty =>
   list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
 l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
+  statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
+l <= statsFor(a).upperBound.substr(0, Length(l))
+case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
--- End diff --

BTW,  a.startswith(b) and b.startswith(a) are not same but why are they 
same here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r233009506
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -104,6 +106,14 @@ class UnivocityParser(
 requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, 
options)).toArray
   }
 
+  private val decimalParser = if (SQLConf.get.legacyDecimalParsing) {
+(s: String) => new BigDecimal(s.replaceAll(",", ""))
--- End diff --

@MaxGekk, looks we should do the same thing in schema inference path as 
well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21588
  
To all, so how about we start the fix @wangyum tried before? If we are 
generally agreed upon the direction itself, upgrading Hive to 2.3 (or 3), I 
would like to encourage him to continue #20659.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21588
  
The test failure itself doesn't look caused by this change. The tests will 
fail anyway with a different error message.

If the goal is really just to check if the tests pass or not, you should 
use `com.github.hyukjinkwon` artifact instead of `org.spark-project.hive` in 
Spark which I released under my personal domain to test this, as I did above.

We should fix the hive issue we're discussing right now first so that we 
can safely merge this PR and verify that Hadoop 3 profile works fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22962#discussion_r232991503
  
--- Diff: python/pyspark/taskcontext.py ---
@@ -147,8 +147,8 @@ def __init__(self):
 @classmethod
 def _getOrCreate(cls):
 """Internal function to get or create global BarrierTaskContext."""
-if cls._taskContext is None:
-cls._taskContext = BarrierTaskContext()
+if not isinstance(cls._taskContext, BarrierTaskContext):
+cls._taskContext = object.__new__(cls)
--- End diff --

BTW, I think we should just `BarrierTaskContext()`. Let's don't make it 
complicated next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22962: [SPARK-25921][PySpark] Fix barrier task run without Barr...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22962
  
The main code change LGTM too in any event


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22962: [SPARK-25921][PySpark] Fix barrier task run witho...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22962#discussion_r232990319
  
--- Diff: python/pyspark/tests.py ---
@@ -618,10 +618,13 @@ def test_barrier_with_python_worker_reuse(self):
 """
 Verify that BarrierTaskContext.barrier() with reused python worker.
 """
+self.sc._conf.set("spark.python.work.reuse", "true")
--- End diff --

Yup. sorry for late response.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
For your information, here's the line counts for each file:

```
  52 ./test_utils.py
 197 ./test_catalog.py
  43 ./test_group.py
 318 ./test_session.py
 151 ./test_readwriter.py
2180 ./test_pandas_udf.py
 564 ./test_streaming.py
  53 ./test_conf.py
 155 ./test_column.py
 652 ./test_udf.py
 276 ./test_functions.py
 261 ./test_context.py
 136 ./test_serde.py
 168 ./test_datasources.py
 397 ./test_arrow.py
  94 ./test_appsubmit.py
 942 ./test_types.py
 735 ./test_dataframe.py
```

It's rather evenly distributed, except for `./test_pandas_udf.py`. Maybe we 
should make `test_scalar_pandas_udf.py` and etc. but I would like to avoid this 
for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
FWIW, I at least double checked if they are any tests missing, and if they 
are actually being ran (via coverage).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
adding @rxin (derived from mailing list)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23021
  
@BryanCutler and @squito, Here is the official first attempt to break 
`pyspark/sql/tests.py` into multiple small files.

If there are no outstanding issues (for instance, if we are agreed upon the 
high level approach), I would like to push this quicker as this will super 
likely have conflicts easily.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23021: [SPARK-26032][PYTHON] Break large sql/tests.py fi...

2018-11-13 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23021

[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files

## What changes were proposed in this pull request?

This is the official first attempt to break huge single `tests.py` file - I 
did it locally before few times and gave up for some reasons. Now, currently it 
really makes the unittests super hard to read and difficult to check. To me, it 
even bothers me to to scroll down the big file. It's one single 7000 lines file!

This is not only readability issue. Since one big test takes most of tests 
time, the tests don't run in parallel fully, causing the test time increases.

We could pick up one example and follow. Given my investigation, the 
current style looks closer to NumPy structure and looks easier to follow. 
Please see https://github.com/numpy/numpy/tree/master/numpy.

Basically this PR proposes to break down `pyspark/sql/tests.py` into ...:

```bash
pyspark
...
âââ sql
...
âÂ Â  âââ tests  # Includes all tests broken down from 
'pyspark/sql/tests.py'
â   âÂ   â  # Each matchs to module in 'pyspark/sql'. 
Additionally, some logical group can be added.
â   âÂ   â  # For instance, 'test_arrow.py', 
'test_datasources.py', 'test_serde.py' and 'test_submit.py'
â   âÂ   â  # were added.
âÂ Â  âÂ Â  âââ __init__.py
âÂ Â  âÂ Â  âââ test_arrow.py
âÂ Â  âÂ Â  âââ test_catalog.py
âÂ Â  âÂ Â  âââ test_column.py
âÂ Â  âÂ Â  âââ test_conf.py
âÂ Â  âÂ Â  âââ test_context.py
âÂ Â  âÂ Â  âââ test_dataframe.py
âÂ Â  âÂ Â  âââ test_datasources.py
âÂ Â  âÂ Â  âââ test_functions.py
âÂ Â  âÂ Â  âââ test_group.py
âÂ Â  âÂ Â  âââ test_readwriter.py
âÂ Â  âÂ Â  âââ test_serde.py
âÂ Â  âÂ Â  âââ test_session.py
âÂ Â  âÂ Â  âââ test_streaming.py
âÂ Â  âÂ Â  âââ test_submit.py
âÂ Â  âÂ Â  âââ test_types.py
âÂ Â  âÂ Â  âââ test_udf.py
âÂ Â  âÂ Â  âââ test_utils.py
...
âââ testing  # Includes testing utils that can be used in unittests.
âÂ Â  âââ __init__.py
âÂ Â  âââ sqlutils.py
...
```

## How was this patch tested?

Existing tests should cover.

`cd python` and `./run-tests-with-coverage`. Manually checked they are 
actually being ran.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25344

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23021.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23021


commit dee09b73757bbad73cbe274d6a9873d8cc143423
Author: hyukjinkwon 
Date:   2018-11-13T08:41:57Z

Break larga sql/tests.py files into smaller files




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23020: [MINOR][BUILD] Remove *.crc from .gitignore

2018-11-13 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23020
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232895848
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
 createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
 numPartitions = NULL) {
   sparkSession <- getSparkSession()
-
+  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == 
"true"
+  shouldUseArrow <- FALSE
+  firstRow <- NULL
   if (is.data.frame(data)) {
-  # Convert data into a list of rows. Each row is a list.
-
-  # get the names of columns, they will be put into RDD
-  if (is.null(schema)) {
-schema <- names(data)
-  }
+# get the names of columns, they will be put into RDD
+if (is.null(schema)) {
+  schema <- names(data)
+}
 
-  # get rid of factor type
-  cleanCols <- function(x) {
-if (is.factor(x)) {
-  as.character(x)
-} else {
-  x
-}
+# get rid of factor type
+cleanCols <- function(x) {
+  if (is.factor(x)) {
+as.character(x)
+  } else {
+x
   }
+}
+data[] <- lapply(data, cleanCols)
+
+args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+if (arrowEnabled) {
+  shouldUseArrow <- tryCatch({
--- End diff --

Yup, correct. Let me address other comments as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23006: [SPARK-26007][SQL] DataFrameReader.csv() respects to spa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23006
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23014: [MINOR][SQL] Add disable bucketedRead workaround ...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23014#discussion_r232893546
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -101,10 +101,11 @@ private void throwUnsupportedException(int 
requiredCapacity, Throwable cause) {
 String message = "Cannot reserve additional contiguous bytes in the 
vectorized reader (" +
 (requiredCapacity >= 0 ? "requested " + requiredCapacity + " 
bytes" : "integer overflow") +
 "). As a workaround, you can reduce the vectorized reader batch 
size, or disable the " +
-"vectorized reader. For parquet file format, refer to " +
+"vectorized reader, or disable " + 
SQLConf.BUCKETING_ENABLED().key() + " if you read " +
+"from bucket table. For Parquet file format, refer to " +
 SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().key() +
 " (default " + 
SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().defaultValueString() +
-") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; 
for orc file format, " +
+") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; 
for Orc file format, " +
--- End diff --

`Orc` is `ORC` BTW :-).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23014
  
> The reason is that each bucket file is too big

Can you elaborate please? Is it because we don't chunk each file into 
multiple splits when we read bucketed table?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23014: [MINOR][SQL] Add disable bucketedRead workaround ...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23014#discussion_r232885260
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 ---
@@ -101,7 +101,8 @@ private void throwUnsupportedException(int 
requiredCapacity, Throwable cause) {
 String message = "Cannot reserve additional contiguous bytes in the 
vectorized reader (" +
 (requiredCapacity >= 0 ? "requested " + requiredCapacity + " 
bytes" : "integer overflow") +
 "). As a workaround, you can reduce the vectorized reader batch 
size, or disable the " +
-"vectorized reader. For parquet file format, refer to " +
+"vectorized reader, or disable " + 
SQLConf.BUCKETING_ENABLED().key() + " if you read " +
+"from bucket table. For parquet file format, refer to " +
--- End diff --

parquet -> Parquet


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23018: [SPARK-26023][SQL] Dumping truncated plans and generated...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23018
  
Looks fine to me. adding @cloud-fan and @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23018: [SPARK-26023][SQL] Dumping truncated plans and ge...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23018#discussion_r232883084
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -469,7 +471,21 @@ abstract class TreeNode[BaseType <: 
TreeNode[BaseType]] extends Product {
   def treeString: String = treeString(verbose = true)
 
   def treeString(verbose: Boolean, addSuffix: Boolean = false): String = {
-generateTreeString(0, Nil, new StringBuilder, verbose = verbose, 
addSuffix = addSuffix).toString
+val writer = new StringBuilderWriter()
+try {
+  treeString(writer, verbose, addSuffix, None)
+  writer.toString
+} finally {
+  writer.close()
+}
+  }
+
+  def treeString(
+  writer: Writer,
+  verbose: Boolean,
+  addSuffix: Boolean,
+  maxFields: Option[Int]): Unit = {
+generateTreeString(0, Nil, writer, verbose, "", addSuffix)
--- End diff --

If #22879 is merged first, we should add that function here. If this one is 
merged first, that PR better have the function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
In this way, we could postpone R upgrade after Spark 3.0.0 release in 
Jenkins, and could still test the deprecated R version 3.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
Nice. Thanks!. BTW Felix, are you maybe worrying about that we happen to 
upgrade R version in Jenkins to 3.4 and .. we could break lower deprecated R 
version support in Spark 3.0 I guess?

If so, let me put the version check into both places `generic.R` and 
`shell.R`. In this way, both shell and submit still show the errors but the 
tests will pass.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
Thank you all!!


---

[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
BTW, let.s test them in end-to-end. For instance, 
`spark.range(1).rdd.map(lambda blabla).count()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
If the perf diff is big, let's don't change but document that we can use 
`CloudPickleSerializer()` to avoid breaking change.

If the perf diff is rather trivial, let's check if we can keep this change. 
I will help to check the perf in this case as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
Nope, it should be manually done.. should be great to have it FWIW.

I am not yet sure how we're going to measure the performance. I think you 
can show the performance diff for namedtuple for now - that's going to at the 
very least show some numbers to compare.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23011: [SPARK-26013][R][BUILD] Upgrade R tools version from 3.4...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23011
  
Merged to master.

Thanks, @srowen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23012#discussion_r232742917
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -283,6 +283,10 @@ sparkR.session <- function(
   enableHiveSupport = TRUE,
   ...) {
 
+  if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), 
"3.4.0") == -1) {
+warning("R prior to version 3.4 is deprecated as of Spark 3.0.")
+  }
--- End diff --

@felixcheung, I updated the condition to check major version as well.

BTW, if we add it in .First in general.R, looks it's not going to print out 
the warnings in the SparkR shell. When we drop PySpark, we did in a similar 
place 
(https://github.com/apache/spark/pull/15733/files#diff-01e513a19a2a6f31ecaabd3dd6ac12d9R192).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23012
  
Yea will take a look to address. But about documenting unsupported, if we 
explicitly are going to say it's unsupported and dropped, for instance, we 
should remove the compatibility change 
(https://github.com/apache/spark/blob/master/R/pkg/src-native/string_hash_code.c)
 and I assume previous versions don't work. Deprecation step might be more 
concervative and consistent with dropping steps of other language APIs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22429
  
Ooops i rished to read. Yea but still sounds related but orthogonal. Let's 
move it to mailing list. That should be the best place to discuss further.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...

2018-11-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22429
  
@boy-uber, for structured streaming, let's do it out of this PR. I think 
the actual change of this PR can be small (1.). We can change this API for 
structured streaming later if needed since this is just an internal private 
API. Change (2.) can be shared I guess even if we go for structured streaming 
stuff.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 12680 matches

Mail list logo