[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22772
  
Minor fixes are minor .. no need to rush to get this into 2.4. Let's take a 
look few more times before going ahead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22770: [SPARK-25771][PYSPARK]Fix improper synchronization in Py...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22770
  
Please fill `How was this patch tested?` as well in the PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22770: [SPARK-25771][PYSPARK]Fix improper synchronization in Py...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22770
  
Please feel `How was this patch tested?` as well in the PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22770: [SPARK-25771][PYSPARK]Fix improper synchronizatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22770#discussion_r226515494
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -31,15 +32,15 @@ import org.apache.spark.security.SocketAuthHelper
 import org.apache.spark.util.{RedirectThread, Utils}
 
 private[spark] class PythonWorkerFactory(pythonExec: String, envVars: 
Map[String, String])
-  extends Logging {
+  extends Logging { self =>
 
   import PythonWorkerFactory._
 
   // Because forking processes from Java is expensive, we prefer to launch 
a single Python daemon,
   // pyspark/daemon.py (by default) and tell it to fork new workers for 
our tasks. This daemon
   // currently only works on UNIX-based systems now because it uses 
signals for child management,
   // so we can also fall back to launching workers, pyspark/worker.py (by 
default) directly.
-  val useDaemon = {
+  private val useDaemon = {
--- End diff --

`daemonModule` and `workerModule` too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22770: [SPARK-25771][PYSPARK]Fix improper synchronizatio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22770#discussion_r226515461
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -31,15 +32,15 @@ import org.apache.spark.security.SocketAuthHelper
 import org.apache.spark.util.{RedirectThread, Utils}
 
 private[spark] class PythonWorkerFactory(pythonExec: String, envVars: 
Map[String, String])
-  extends Logging {
+  extends Logging { self =>
 
   import PythonWorkerFactory._
 
   // Because forking processes from Java is expensive, we prefer to launch 
a single Python daemon,
   // pyspark/daemon.py (by default) and tell it to fork new workers for 
our tasks. This daemon
   // currently only works on UNIX-based systems now because it uses 
signals for child management,
   // so we can also fall back to launching workers, pyspark/worker.py (by 
default) directly.
-  val useDaemon = {
+  private val useDaemon = {
--- End diff --

Why are we fixing this? Looks not directly related to the fix and this is 
already private class.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22772: [SPARK-24499][SQL][DOC][Followup] Fix some broken links

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22772
  
I think we better do some proof readings before doing multiple minor 
followups.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@cont...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22762#discussion_r226377181
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -225,6 +225,63 @@ def sql_conf(self, pairs):
 else:
 self.spark.conf.set(key, old_value)
 
+@contextmanager
+def database(self, *databases):
+"""
+A convenient context manager to test with some specific databases. 
This drops the given
+databases if exist and sets current database to "default" when it 
exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for db in databases:
+self.spark.sql("DROP DATABASE IF EXISTS %s CASCADE" % db)
+self.spark.catalog.setCurrentDatabase("default")
+
+@contextmanager
+def table(self, *tables):
+"""
+A convenient context manager to test with some specific tables. 
This drops the given tables
+if exist when it exits.
--- End diff --

wait .. not a big deal but typo: `if exist when it exits.`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmana...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22762
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@cont...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22762#discussion_r226267706
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -225,6 +225,55 @@ def sql_conf(self, pairs):
 else:
 self.spark.conf.set(key, old_value)
 
+@contextmanager
+def database(self, *databases):
+"""
+A convenient context manager to test with some specific databases. 
This drops the given
+databases if exist and sets current database to "default" when it 
exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+if len(databases) == 1 and isinstance(databases[0], (list, set)):
--- End diff --

like .. an assert by this condition `all(map(lambda d: isinstance(d, str), 
databases))` should be good enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@cont...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22762#discussion_r226264798
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -225,6 +225,55 @@ def sql_conf(self, pairs):
 else:
 self.spark.conf.set(key, old_value)
 
+@contextmanager
+def database(self, *databases):
+"""
+A convenient context manager to test with some specific databases. 
This drops the given
+databases if exist and sets current database to "default" when it 
exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+if len(databases) == 1 and isinstance(databases[0], (list, set)):
--- End diff --

hey @ueshin, how about we just allow `database(a, b, ...)` case only for 
now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22762: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmana...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22762
  
I like it!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22171
  
Hm, actually I thought this makes sense tho.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22295
  
Looks close to go.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226184011
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2713,6 +2713,25 @@ def from_csv(col, schema, options={}):
 return Column(jc)
 
 
+@since(3.0)
+def _getActiveSession():
--- End diff --

I mean the function itself .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22761: [MINOR][DOC] Spacing items in migration guide for readab...

2018-10-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22761
  
Of course!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3034: [WIP] ZEPPELIN-3552. Support Scala 2.12 of SparkInterp...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3034
  
> Spark 2.4 will officially support Scala 2.12, so it will be great if 
Zeppelin will support it together with Spark. And also, there are some libs 
that are Scala 2.12 only

The default distribution will still be with Scala 2.11 for Spark 2.4 if I 
am not mistaken. It is nice to support it but Spark 2.4 with 2.11 should be 
supported first as a higher priority. I can work on 2.4.0 with 2.12 support 
further after this one got merged.


---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/zeppelin/pull/3206


---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/zeppelin/pull/3206

[WIP][ZEPPELIN-3810] Support Spark 2.4

### What is this PR for?

Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see 
SPARK-24418).

There are two problems for this upgrade at Zeppelin side:

1.. Some methods that are used in private by reflection, for instance, 
`loopPostInit` became inaccessible.

See:
 - 
https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala
 - 
https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala

To work around this, I manually ported `loopPostInit` at 2.11.8 to retain 
the behaviour. Some functions that are commonly existing at both Scala 2.11.8 
and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection.


2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. 
Otherwise, we will hit:
```
Caused by: java.lang.NoSuchMethodError: 

jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
  at 
scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
```

To work around this, I tweaked this by upgrading jline from `2.12.1` to 
`2.14.3`.


### What type of PR is it?
[Improvement]

### Todos
* [ ] - Wait until Spark 2.4.0 is officially released.

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3810

### How should this be tested?

Verified manually against Spark 2.4.0 RC3

### Questions:
* Does the licenses files need update? Yes
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3206


commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9
Author: hyukjinkwon 
Date:   2018-10-17T14:41:29Z

Support Spark 2.4




---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/zeppelin/pull/3206

[WIP][ZEPPELIN-3810] Support Spark 2.4

### What is this PR for?

Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see 
SPARK-24418).

There are two problems for this upgrade at Zeppelin side:

1.. Some methods that are used in private by reflection, for instance, 
`loopPostInit` became inaccessible.

See:
 - 
https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala
 - 
https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala

To work around this, I manually ported `loopPostInit` at 2.11.8 to retain 
the behaviour. Some functions that are commonly existing at both Scala 2.11.8 
and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection.


2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. 
Otherwise, we will hit:
```
Caused by: java.lang.NoSuchMethodError: 

jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
  at 
scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
```

To work around this, I tweaked this by upgrading jline from `2.12.1` to 
`2.14.3`.


### What type of PR is it?
[Improvement]

### Todos
* [ ] - Wait until Spark 2.4.0 is officially released.

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3810

### How should this be tested?

Verified manually against Spark 2.4.0 RC3

### Questions:
* Does the licenses files need update? Yes
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3206


commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9
Author: hyukjinkwon 
Date:   2018-10-17T14:41:29Z

Support Spark 2.4




---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/zeppelin/pull/3206


---


[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226166020
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self):
 spark.stop()
 
 
+class SparkSessionTests2(unittest.TestCase):
+
+def test_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+activeSession = SparkSession.getActiveSession()
+df = activeSession.createDataFrame([(1, 'Alice')], ['age', 
'name'])
+self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
+finally:
+spark.stop()
+
+def test_get_active_session_when_no_active_session(self):
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, spark)
+spark.stop()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+
+def test_SparkSession(self):
+spark = SparkSession.builder \
+.master("local") \
+.config("some-config", "v2") \
+.getOrCreate()
+try:
+self.assertEqual(spark.conf.get("some-config"), "v2")
+self.assertEqual(spark.sparkContext._conf.get("some-config"), 
"v2")
+self.assertEqual(spark.version, spark.sparkContext.version)
+spark.sql("CREATE DATABASE test_db")
+spark.catalog.setCurrentDatabase("test_db")
+self.assertEqual(spark.catalog.currentDatabase(), "test_db")
+spark.sql("CREATE TABLE table1 (name STRING, age INT) USING 
parquet")
+self.assertEqual(spark.table("table1").columns, ['name', 
'age'])
+self.assertEqual(spark.range(3).count(), 3)
+finally:
+spark.stop()
+
+def test_global_default_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+self.assertEqual(SparkSession.builder.getOrCreate(), spark)
+finally:
+spark.stop()
+
+def test_default_and_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+activeSession = spark._jvm.SparkSession.getActiveSession()
+defaultSession = spark._jvm.SparkSession.getDefaultSession()
+try:
+self.assertEqual(activeSession, defaultSession)
+finally:
+spark.stop()
+
+def test_config_option_propagated_to_existing_SparkSession(self):
+session1 = SparkSession.builder \
+.master("local") \
+.config("spark-config1", "a") \
+.getOrCreate()
+self.assertEqual(session1.conf.get("spark-config1"), "a")
+session2 = SparkSession.builder \
+.config("spark-config1", "b") \
+.getOrCreate()
+try:
+self.assertEqual(session1, session2)
+self.assertEqual(session1.conf.get("spark-config1"), "b")
+finally:
+session1.stop()
+
+def test_new_session(self):
+session = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+newSession = session.newSession()
+try:
+self.assertNotEqual(session, newSession)
+finally:
+session.stop()
+newSession.stop()
+
+def test_create_new_session_if_old_session_stopped(self):
+session = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+session.stop()
+newSession = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+self.assertNotEqual(session, newSession)
+finally:
+newSession.stop()
+
+def test_active_session_with_None_and_not_None_context(self):
+from pyspark.context import SparkContext
+from pyspark.conf import SparkConf
+sc = SparkContext._active_spark_co

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226166057
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self):
 spark.stop()
 
 
+class SparkSessionTests2(unittest.TestCase):
+
+def test_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+activeSession = SparkSession.getActiveSession()
+df = activeSession.createDataFrame([(1, 'Alice')], ['age', 
'name'])
+self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
+finally:
+spark.stop()
+
+def test_get_active_session_when_no_active_session(self):
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, spark)
+spark.stop()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+
+def test_SparkSession(self):
+spark = SparkSession.builder \
+.master("local") \
+.config("some-config", "v2") \
+.getOrCreate()
+try:
+self.assertEqual(spark.conf.get("some-config"), "v2")
+self.assertEqual(spark.sparkContext._conf.get("some-config"), 
"v2")
+self.assertEqual(spark.version, spark.sparkContext.version)
+spark.sql("CREATE DATABASE test_db")
+spark.catalog.setCurrentDatabase("test_db")
+self.assertEqual(spark.catalog.currentDatabase(), "test_db")
+spark.sql("CREATE TABLE table1 (name STRING, age INT) USING 
parquet")
+self.assertEqual(spark.table("table1").columns, ['name', 
'age'])
+self.assertEqual(spark.range(3).count(), 3)
+finally:
+spark.stop()
+
+def test_global_default_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+self.assertEqual(SparkSession.builder.getOrCreate(), spark)
+finally:
+spark.stop()
+
+def test_default_and_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+activeSession = spark._jvm.SparkSession.getActiveSession()
+defaultSession = spark._jvm.SparkSession.getDefaultSession()
+try:
+self.assertEqual(activeSession, defaultSession)
+finally:
+spark.stop()
+
+def test_config_option_propagated_to_existing_SparkSession(self):
--- End diff --

Let's just above `SparkSession` -> `spark_session`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226165866
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2713,6 +2713,25 @@ def from_csv(col, schema, options={}):
 return Column(jc)
 
 
+@since(3.0)
+def _getActiveSession():
--- End diff --

eh.. why is it in functions.py?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21990
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21990
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22675: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22675
  
Looks cool otherwise!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226165068
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,90 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON and JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it 
can load compressed image (jpeg, png, etc.) into raw image representation via 
ImageIO in Java library.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
--- End diff --

Shall we consistently make some codes such as `StructType` as codes like `` 
`StructType` ``


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226164813
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,90 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
--- End diff --

Should it be `Datasource` or `Data sources`? I am saying this because there 
looks a mismatch with the menu above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226164867
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,90 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON and JDBC, we also 
provide some specific data source for ML.
--- End diff --

really personal preference tho .. `like` -> `such as`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226164476
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,90 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON and JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it 
can load compressed image (jpeg, png, etc.) into raw image representation via 
ImageIO in Java library.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
--- End diff --

`.` -> `,`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226164264
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,90 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON and JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it 
can load compressed image (jpeg, png, etc.) into raw image representation via 
ImageIO in Java library.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
+The schema of the `image` column is:
+ - origin: String (represents the file path of the image)
--- End diff --

I would use SQL types consistently, for instance, StringType, IntegerType


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226163959
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,49 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON, JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
--- End diff --

Where's describing each field?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226163638
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,49 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON, JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
+
+
+

+[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
+implements a Spark SQL data source API for loading image data as a 
DataFrame.
+
+{% highlight scala %}
+scala> spark.read.format("image").load("data/mllib/images/origin")
+res1: org.apache.spark.sql.DataFrame = [image: struct]
+{% endhighlight %}
+
+
+

+[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
+implements Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight java %}
+Dataset imagesDF = 
spark.read().format("image").load("data/mllib/images/origin");
+{% endhighlight %}
+
+
+
--- End diff --

Shall we add an example for R as well then? It wouldn't be too difficult to 
add the equivalent examples. Also, I don't think we will add the equivalent 
examples in different languages at different pages.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22503#discussion_r226149299
  
--- Diff: sql/core/src/test/resources/test-data/cars-crlf.csv ---
@@ -0,0 +1,7 @@
+
+year,make,model,comment,blank
+"2012","Tesla","S","No comment",
+
+1997,Ford,E350,"Go get one now they are going fast",
+2015,Chevy,Volt
+
--- End diff --

just for clarification, this file has newline variant combinations, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22761: [MINOR][DOC] Spacing items in migration guide for...

2018-10-17 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22761

[MINOR][DOC] Spacing items in migration guide for readability and 
consistency

## What changes were proposed in this pull request?

Currently, migration guide has no space between each item which looks too 
compact and hard to read. Some of items already had some spaces between them in 
the migration guide. This PR suggest to format them consistently for 
readability.

Before:

![screen shot 2018-10-18 at 9 49 12 
am](https://user-images.githubusercontent.com/6477701/47126706-4c43da80-d2bc-11e8-966f-ad804ab23174.png)

After:

![screen shot 2018-10-18 at 9 53 55 
am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png)

## How was this patch tested?

Manually tested:



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-migration-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22761.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22761


commit e1d6a0a9985d0351e4dcc2f868141a079694432d
Author: hyukjinkwon 
Date:   2018-10-18T01:56:16Z

Format migration guide for readability




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22732#discussion_r226145444
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1978,6 +1978,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. 
In version 2.3 and earlier, empty strings are equal to `null` values and do not 
reflect to any characters in saved CSV files. For example, the row of `"a", 
null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as 
`a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to 
empty (not quoted) string.  
   - Since Spark 2.4, The LOAD DATA command supports wildcard `?` and `*`, 
which match any one character, and zero or more characters, respectively. 
Example: `LOAD DATA INPATH '/tmp/folder*/'` or `LOAD DATA INPATH 
'/tmp/part-?'`. Special Characters like `space` also now work in paths. 
Example: `LOAD DATA INPATH '/tmp/folder name/'`.
   - In Spark version 2.3 and earlier, HAVING without GROUP BY is treated 
as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as 
`SELECT 1 FROM range(10) WHERE true`  and returns 10 rows. This violates SQL 
standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without 
GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) 
HAVING true` will return only one row. To restore the previous behavior, set 
`spark.sql.legacy.parser.havingWithoutGroupByAsWhere` to `true`.
+  - Since Spark 2.4, use of the method `def udf(f: AnyRef, dataType: 
DataType): UserDefinedFunction` should not expect any automatic null handling 
of the input parameters, thus a null input of a Scala primitive type will be 
converted to the type's corresponding default value in the UDF. All other UDF 
declaration and registration methods remain the same behavior as before.
--- End diff --

Does this target to backport to 2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22732#discussion_r226145051
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -393,4 +393,30 @@ class UDFSuite extends QueryTest with SharedSQLContext 
{
   checkAnswer(df, Seq(Row("12"), Row("24"), Row("3null"), Row(null)))
 }
   }
+
+  test("SPARK-25044 Verify null input handling for primitive types - with 
udf()") {
+val udf1 = udf({(x: Long, y: Any) => x * 2 + (if (y == null) 1 else 
0)})
--- End diff --

nit `udf((x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22732#discussion_r226145150
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -393,4 +393,30 @@ class UDFSuite extends QueryTest with SharedSQLContext 
{
   checkAnswer(df, Seq(Row("12"), Row("24"), Row("3null"), Row(null)))
 }
   }
+
+  test("SPARK-25044 Verify null input handling for primitive types - with 
udf()") {
+val udf1 = udf({(x: Long, y: Any) => x * 2 + (if (y == null) 1 else 
0)})
+val df = spark.range(0, 3).toDF("a")
+  .withColumn("b", udf1($"a", lit(null)))
+  .withColumn("c", udf1(lit(null), $"a"))
+
+checkAnswer(
+  df,
+  Seq(
+Row(0, 1, null),
+Row(1, 3, null),
+Row(2, 5, null)))
+  }
+
+  test("SPARK-25044 Verify null input handling for primitive types - with 
udf.register") {
+withTable("t") {
+  sql("create table t(a varchar(10), b int, c varchar(10)) using 
parquet")
--- End diff --

nit: upper case for keywords


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22732#discussion_r226144670
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1951,7 +1951,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.
   
   
-Users can use explict cast
+Users can use explicit cast
--- End diff --

There's the same word mistake right above `explict cast` btw.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/zeppelin/pull/3206


---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/zeppelin/pull/3206

[WIP][ZEPPELIN-3810] Support Spark 2.4

### What is this PR for?

Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see 
SPARK-24418).

There are two problems for this upgrade at Zeppelin side:

1.. Some methods that are used in private by reflection, for instance, 
`loopPostInit` became inaccessible.

See:
 - 
https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala
 - 
https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala

To work around this, I manually ported `loopPostInit` at 2.11.8 to retain 
the behaviour. Some functions that are commonly existing at both Scala 2.11.8 
and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection.


2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. 
Otherwise, we will hit:
```
Caused by: java.lang.NoSuchMethodError: 

jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
  at 
scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
```

To work around this, I tweaked this by upgrading jline from `2.12.1` to 
`2.14.3`.


### What type of PR is it?
[Improvement]

### Todos
* [ ] - Wait until Spark 2.4.0 is officially released.

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3810

### How should this be tested?

Verified manually against Spark 2.4.0 RC3

### Questions:
* Does the licenses files need update? Yes
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3206


commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9
Author: hyukjinkwon 
Date:   2018-10-17T14:41:29Z

Support Spark 2.4




---


[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21990
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
oops. I haven't. Will check that too while I am here.

BTW, my understanding is that we need this one as well since Spark still 
can be compiled against Scala 2.11.x, am I in the right way?


---


[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
This is a WIP. We should wait for Spark 2.4.0.

cc @zjffdu and @felixcheung


---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-17 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/zeppelin/pull/3206

[WIP][ZEPPELIN-3810] Support Spark 2.4

### What is this PR for?

Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see 
SPARK-24418).

There are two problems for this upgrade at Zeppelin side:

1.. Some methods that are used in private by reflection, for instance, 
`loopPostInit` became inaccessible.

See:
 - 
https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala
 - 
https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala

To work around this, I manually ported `loopPostInit` at 2.11.8 to retain 
the behaviour. Some functions that are commonly existing at both Scala 2.11.8 
and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection.


2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. 
Otherwise, we will hit:
```
Caused by: java.lang.NoSuchMethodError: 

jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
  at 
scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
```

To work around this, I tweaked this by upgrading jline from `2.12.1` to 
`2.14.3`.


### What type of PR is it?
[Improvement]

### Todos
* [ ] - Wait until Spark 2.4.0 is officially released.

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3810

### How should this be tested?

Verified manually against Spark 2.4.0 RC3

### Questions:
* Does the licenses files need update? Yes
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3206


commit e2d224aadf0e54533837a9a89f8e9d3586aee3a9
Author: hyukjinkwon 
Date:   2018-10-17T14:41:29Z

Support Spark 2.4




---


[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20503
  
Looks good otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20503#discussion_r225846867
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1581,7 +1581,7 @@ def __repr__(self):
 return "Row(%s)" % ", ".join("%s=%r" % (k, v)
  for k, v in zip(self.__fields__, 
tuple(self)))
 else:
-return "" % ", ".join(self)
+return "" % ", ".join("%s" % (fields) for fields in 
self)
--- End diff --

`"%s" % (fields) for fields in self` -> `"%s" % field for field in self`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20503#discussion_r225846669
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1581,7 +1581,7 @@ def __repr__(self):
 return "Row(%s)" % ", ".join("%s=%r" % (k, v)
  for k, v in zip(self.__fields__, 
tuple(self)))
 else:
-return "" % ", ".join(self)
+return "" % ", ".join("%s" % (fields) for fields in 
self)
--- End diff --

nit fields => field


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20503#discussion_r225846584
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1581,7 +1581,7 @@ def __repr__(self):
 return "Row(%s)" % ", ".join("%s=%r" % (k, v)
  for k, v in zip(self.__fields__, 
tuple(self)))
 else:
-return "" % ", ".join(self)
+return "" % ", ".join("%s" % (fields) for fields in 
self)
--- End diff --

nit `"%s" % (fields) for fields in self` -> `"%s" % fields for fields in 
self`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20503#discussion_r225846381
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -234,6 +234,10 @@ def test_empty_row(self):
 row = Row()
 self.assertEqual(len(row), 0)
 
+def test_row_without_column_name(self):
+row = Row("Alice", 11)
+self.assertEqual(row.__repr__(), "")
--- End diff --

I would test non-ascii compatible characters as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviou...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20503#discussion_r225844655
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -234,6 +234,10 @@ def test_empty_row(self):
 row = Row()
 self.assertEqual(len(row), 0)
 
+def test_row_without_column_name(self):
+row = Row("Alice", 11)
--- End diff --

Can we add a doctest for this usage (Row as objects not as a namedtuple 
class), and documentation in `Row` at `types.py`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22646#discussion_r225820193
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext {
 })
 }
 }
-def createConverter(cls: Class[_], dataType: DataType): Any => Any = 
dataType match {
-  case struct: StructType => createStructConverter(cls, 
struct.map(_.dataType))
-  case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+def createConverter(t: Type, dataType: DataType): Any => Any = (t, 
dataType) match {
+  case (cls: Class[_], struct: StructType) =>
--- End diff --

Hm, about we fix them together while we are here?

I also checked another difference which is beans without getter and/or 
setter but I think this is something we should fix in 3.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22646#discussion_r225819586
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext {
 })
 }
 }
-def createConverter(cls: Class[_], dataType: DataType): Any => Any = 
dataType match {
-  case struct: StructType => createStructConverter(cls, 
struct.map(_.dataType))
-  case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+def createConverter(t: Type, dataType: DataType): Any => Any = (t, 
dataType) match {
--- End diff --

Yea .. was just thinking of moving this func to there .. looks ugly that 
this file getting long.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225819012
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

Actually either way looks okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-10-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r225818067
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -1136,4 +1121,27 @@ object SparkSession extends Logging {
   SparkSession.clearDefaultSession()
 }
   }
+
+  /**
+   * Initialize extensions if the user has defined a configurator class in 
their SparkConf.
+   * This class will be applied to the extensions passed into this 
function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: 
SparkSessionExtensions) {
--- End diff --

Eh .. I think it's okay to have a function and returns that updated 
extensions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22694: [SQL][CATALYST][MINOR] update some error comments

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22694
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22503
  
@justinuang, okay. Mind rebasing this please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22295
  
@huaxingao, thanks for addressing comments. Would you mind rebasing it and 
resolving the conflicts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21588
  
@rxin and @gatorsmile, WDYT?

I already had to argue about Hadoop 3 support here and there (for instance 
see [SPARK-18112|https://issues.apache.org/jira/browse/SPARK-18112] and 
[SPARK-18673|https://issues.apache.org/jira/browse/SPARK-18673]), and explain 
what's going on.

Looks ideally we should go ahead 2. 
(https://github.com/apache/spark/pull/21588#issuecomment-429272279) if I am not 
mistaken. If there are some more concerns we should address before going ahead, 
definitely I am willing to help investigating as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
Woah .. let me resolve the conflicts tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
Thanks all!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] incubator-livy issue #123: [MINOR] There is some typo in PULL_REQUEST_TEMPLA...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/incubator-livy/pull/123
  
+1


---


[GitHub] spark issue #22730: [SPARK-16775][CORE] Remove deprecated accumulator v1 API...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22730
  
yup, PySpark uses accumulator V2 per 
https://github.com/apache/spark/commit/90d5754212425d55f992c939a2bc7d9ac6ef92b8


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22597
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22694: [SQL][CATALYST][MINOR] update some error comments

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22694
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22597
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22728: [SPARK-25736][SQL][TEST] add tests to verify the behavio...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22728
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22597
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22636: [SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pu...

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22636
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22732#discussion_r225393373
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -39,29 +40,29 @@ import org.apache.spark.sql.types.DataType
  * @param nullable  True if the UDF can return null value.
  * @param udfDeterministic  True if the UDF is deterministic. 
Deterministic UDF returns same result
  *  each time it is invoked with a particular 
input.
- * @param nullableTypes which of the inputTypes are nullable (i.e. not 
primitive)
  */
 case class ScalaUDF(
 function: AnyRef,
 dataType: DataType,
 children: Seq[Expression],
+handleNullForInputs: Seq[Boolean],
--- End diff --

Adding `handleNullForInputs` doesn't look reducing confusion a lot to me. 
Since this PR targets only refactoring mainly to reduce confusion and the easy 
of use, this concern should be addressed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22732: [SPARK-25044][FOLLOW-UP] Change ScalaUDF construc...

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22732#discussion_r225393220
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -39,29 +40,29 @@ import org.apache.spark.sql.types.DataType
  * @param nullable  True if the UDF can return null value.
  * @param udfDeterministic  True if the UDF is deterministic. 
Deterministic UDF returns same result
  *  each time it is invoked with a particular 
input.
- * @param nullableTypes which of the inputTypes are nullable (i.e. not 
primitive)
  */
 case class ScalaUDF(
 function: AnyRef,
 dataType: DataType,
 children: Seq[Expression],
+handleNullForInputs: Seq[Boolean],
--- End diff --

Maybe I missed something but:

1. Why don't we just merge `handleNullForInputs` and `inputTypes`?
2. Why `handleNullForInputs` is required whereas `inputTypes`'s default is 
`Nil`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21588
  
ping @wangyum, if you're willing to make a progress about this, please 
provide some input here and/or in the JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22379#discussion_r225195159
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.csv._
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Converts a CSV input string to a [[StructType]] with the specified 
schema.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(csvStr, schema[, options]) - Returns a struct value with 
the given `csvStr` and `schema`.",
+  examples = """
+Examples:
+  > SELECT _FUNC_('1, 0.8', 'a INT, b DOUBLE');
+   {"a":1, "b":0.8}
+  > SELECT _FUNC_('26/08/2015', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'))
+   {"time":2015-08-26 00:00:00.0}
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class CsvToStructs(
+schema: StructType,
+options: Map[String, String],
+child: Expression,
+timeZoneId: Option[String] = None)
+  extends UnaryExpression
+with TimeZoneAwareExpression
+with CodegenFallback
+with ExpectsInputTypes
+with NullIntolerant {
+
+  override def nullable: Boolean = child.nullable
+
+  // The CSV input data might be missing certain fields. We force the 
nullability
+  // of the user-provided schema to avoid data corruptions.
+  val nullableSchema: StructType = schema.asNullable
+
+  // Used in `FunctionRegistry`
+  def this(child: Expression, schema: Expression, options: Map[String, 
String]) =
+this(
+  schema = ExprUtils.evalSchemaExpr(schema),
+  options = options,
+  child = child,
+  timeZoneId = None)
+
+  def this(child: Expression, schema: Expression) = this(child, schema, 
Map.empty[String, String])
+
+  def this(child: Expression, schema: Expression, options: Expression) =
+this(
+  schema = ExprUtils.evalSchemaExpr(schema),
+  options = ExprUtils.convertToMapData(options),
+  child = child,
+  timeZoneId = None)
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = (rows: Iterator[InternalRow]) => {
+if (rows.hasNext) {
+  rows.next()
--- End diff --

of course!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/incubator-livy/pull/121#discussion_r225194172
  
--- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java ---
@@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception {
 
   @Test
   public void testSparkSQLJob() throws Exception {
-runTest(true, new TestFunction() {
+runTest(true, false, new TestFunction() {
--- End diff --

This also particularly happens when the session that created Hive client is 
restarted with Hive support enabled multiple times(?). This was able to be 
reproduced in the local as well. I am still digging the root cause but so far 
the scope is specified to this test only.


---


[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
Sorry, I missed that comment. I replied in that comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22379#discussion_r225190659
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -777,7 +777,6 @@ case class SchemaOfJson(
 }
 
 object JsonExprUtils {
-
   def evalSchemaExpr(exp: Expression): DataType = exp match {
--- End diff --

I was following @MaxGekk's opinion 
(https://github.com/apache/spark/pull/22379/files/93d094f45b02afc0ab2f0650bbde1513823471a2#r224846183).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22379#discussion_r225190224
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.csv._
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Converts a CSV input string to a [[StructType]] with the specified 
schema.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(csvStr, schema[, options]) - Returns a struct value with 
the given `csvStr` and `schema`.",
+  examples = """
+Examples:
+  > SELECT _FUNC_('1, 0.8', 'a INT, b DOUBLE');
+   {"a":1, "b":0.8}
+  > SELECT _FUNC_('26/08/2015', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'))
+   {"time":2015-08-26 00:00:00.0}
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class CsvToStructs(
+schema: StructType,
+options: Map[String, String],
+child: Expression,
+timeZoneId: Option[String] = None)
+  extends UnaryExpression
+with TimeZoneAwareExpression
+with CodegenFallback
+with ExpectsInputTypes
+with NullIntolerant {
+
+  override def nullable: Boolean = child.nullable
+
+  // The CSV input data might be missing certain fields. We force the 
nullability
+  // of the user-provided schema to avoid data corruptions.
+  val nullableSchema: StructType = schema.asNullable
+
+  // Used in `FunctionRegistry`
+  def this(child: Expression, schema: Expression, options: Map[String, 
String]) =
+this(
+  schema = ExprUtils.evalSchemaExpr(schema),
+  options = options,
+  child = child,
+  timeZoneId = None)
+
+  def this(child: Expression, schema: Expression) = this(child, schema, 
Map.empty[String, String])
+
+  def this(child: Expression, schema: Expression, options: Expression) =
+this(
+  schema = ExprUtils.evalSchemaExpr(schema),
+  options = ExprUtils.convertToMapData(options),
+  child = child,
+  timeZoneId = None)
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = (rows: Iterator[InternalRow]) => {
+if (rows.hasNext) {
+  rows.next()
--- End diff --

Looks rows can't be more then one in this CSV's code path specifically.

See below:

```scala
val rawParser = new UnivocityParser(actualSchema, actualSchema, 
parsedOptions)
new FailureSafeParser[String](
  input => Seq(rawParser.parse(input)),
  mode,
  nullableSchema,
  parsedOptions.columnNameOfCorruptRecord,
  parsedOptions.multiLine)
```

Univocity parser:

```scala
  def parse(input: String): InternalRow = 
convert(tokenizer.parseLine(input))
```

and in the `FailureSafeParser`

```scala
class FailureSafeParser[IN](
rawParser: IN => Seq[InternalRow],
mode: ParseMode,
schema: StructType,
columnNameOfCorruptRecord: String,
isMultiLine: Boolean) {
```

```scala
  def parse(input: IN): Iterator[InternalRow] = {
try {
 if (skipParsing) {
   Iterator.single(InternalRow.empty)
 } else {
   rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), 
() => null))
 }
} catch {
 

[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22379#discussion_r225186492
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.csv._
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Converts a CSV input string to a [[StructType]] with the specified 
schema.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(csvStr, schema[, options]) - Returns a struct value with 
the given `csvStr` and `schema`.",
+  examples = """
+Examples:
+  > SELECT _FUNC_('1, 0.8', 'a INT, b DOUBLE');
+   {"a":1, "b":0.8}
+  > SELECT _FUNC_('26/08/2015', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'))
+   {"time":2015-08-26 00:00:00.0}
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class CsvToStructs(
+schema: StructType,
+options: Map[String, String],
+child: Expression,
+timeZoneId: Option[String] = None)
+  extends UnaryExpression
+with TimeZoneAwareExpression
+with CodegenFallback
+with ExpectsInputTypes
+with NullIntolerant {
+
+  override def nullable: Boolean = child.nullable
+
+  // The CSV input data might be missing certain fields. We force the 
nullability
+  // of the user-provided schema to avoid data corruptions.
+  val nullableSchema: StructType = schema.asNullable
+
+  // Used in `FunctionRegistry`
+  def this(child: Expression, schema: Expression, options: Map[String, 
String]) =
+this(
+  schema = ExprUtils.evalSchemaExpr(schema),
+  options = options,
+  child = child,
+  timeZoneId = None)
+
+  def this(child: Expression, schema: Expression) = this(child, schema, 
Map.empty[String, String])
+
+  def this(child: Expression, schema: Expression, options: Expression) =
+this(
+  schema = ExprUtils.evalSchemaExpr(schema),
+  options = ExprUtils.convertToMapData(options),
+  child = child,
+  timeZoneId = None)
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = (rows: Iterator[InternalRow]) => {
+if (rows.hasNext) {
+  rows.next()
--- End diff --

Oops, I missed. Let me check


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/incubator-livy/pull/121
  
I will update after 2.4.0 is officially released.


---


[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/incubator-livy/pull/121
  
cc @vanzin, @mgaido91 (already you're here tho :-) ) and @jerryshao.


---


[GitHub] incubator-livy pull request #121: [WIP][LIVY-518][BUILD] Support Spark 2.4

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/incubator-livy/pull/121#discussion_r225185417
  
--- Diff: rsc/src/test/java/org/apache/livy/rsc/TestSparkClient.java ---
@@ -271,7 +275,7 @@ public void call(LivyClient client) throws Exception {
 
   @Test
   public void testSparkSQLJob() throws Exception {
-runTest(true, new TestFunction() {
+runTest(true, false, new TestFunction() {
--- End diff --

I am running Travis CI against RC3 
(https://travis-ci.org/HyukjinKwon/incubator-livy/builds/441687251)


---


[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
While addressing the review comments, I also reviewed at the same time. The 
change looks pretty good to go.

For https://github.com/apache/spark/pull/22379#issuecomment-429738031, I 
guess we can add the string one later at Scala side ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
I wouldn't too, but there's no way for using `schema_of_csv` otherwise ..




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22379
  
Like .. 

```
val csvData = "a,1,2"
from_csv(column, schema = schema_of_csv(lit(csvData)))
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22379#discussion_r225030219
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3854,6 +3854,38 @@ object functions {
   @scala.annotation.varargs
   def map_concat(cols: Column*): Column = withExpr { 
MapConcat(cols.map(_.expr)) }
 
+  /**
+   * Parses a column containing a CSV string into a `StructType` with the 
specified schema.
+   * Returns `null`, in the case of an unparseable string.
+   *
+   * @param e a string column containing CSV data.
+   * @param schema the schema to use when parsing the CSV string
+   * @param options options to control how the CSV is parsed. accepts the 
same options and the
+   *CSV data source.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def from_csv(e: Column, schema: StructType, options: Map[String, 
String]): Column = withExpr {
+CsvToStructs(schema, options, e.expr)
+  }
+
+  /**
+   * (Java-specific) Parses a column containing a CSV string into a 
`StructType`
+   * with the specified schema. Returns `null`, in the case of an 
unparseable string.
+   *
+   * @param e a string column containing CSV data.
+   * @param schema the schema to use when parsing the CSV string
+   * @param options options to control how the CSV is parsed. accepts the 
same options and the
+   *CSV data source.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def from_csv(e: Column, schema: String, options: java.util.Map[String, 
String]): Column = {
--- End diff --

Let me address this one tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/incubator-livy/pull/121
  
2.4 is not yet officially released. I am manually verifying this within my 
local.


---


[GitHub] spark pull request #22697: [SPARK-25700][SQL][BRANCH-2.4] Partially revert a...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/22697


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22597#discussion_r225025622
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25579 ORC PPD should support column names with dot") {
+import testImplicits._
+
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+  withTempDir { dir =>
+val path = new File(dir, "orc").getCanonicalPath
+Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
+val df = spark.read.orc(path).where("`col.dot.1` = 1 and 
`col.dot.2` = 2")
+checkAnswer(stripSparkFilter(df), Row(1, 2))
--- End diff --

Yup.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22597#discussion_r225024900
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25579 ORC PPD should support column names with dot") {
+import testImplicits._
+
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+  withTempDir { dir =>
+val path = new File(dir, "orc").getCanonicalPath
+Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
--- End diff --

How about explicitly repartition to make separate output files?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22646#discussion_r225021394
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext {
 })
 }
 }
-def createConverter(cls: Class[_], dataType: DataType): Any => Any = 
dataType match {
-  case struct: StructType => createStructConverter(cls, 
struct.map(_.dataType))
-  case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+def createConverter(t: Type, dataType: DataType): Any => Any = (t, 
dataType) match {
+  case (cls: Class[_], struct: StructType) =>
--- End diff --

wait .. can we reuse `JavaTypeInference.serializerFor` and make a 
projection, rather then reimplementing whole logics here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22646#discussion_r225021732
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext {
 })
 }
 }
-def createConverter(cls: Class[_], dataType: DataType): Any => Any = 
dataType match {
-  case struct: StructType => createStructConverter(cls, 
struct.map(_.dataType))
-  case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+def createConverter(t: Type, dataType: DataType): Any => Any = (t, 
dataType) match {
+  case (cls: Class[_], struct: StructType) =>
--- End diff --


https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L131-L132

We should drop the support for getter or setter only. adding @cloud-fan 
here as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22646#discussion_r225016843
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext {
 })
 }
 }
-def createConverter(cls: Class[_], dataType: DataType): Any => Any = 
dataType match {
-  case struct: StructType => createStructConverter(cls, 
struct.map(_.dataType))
-  case _ => CatalystTypeConverters.createToCatalystConverter(dataType)
+def createConverter(t: Type, dataType: DataType): Any => Any = (t, 
dataType) match {
--- End diff --

BTW, how about we put this method in `CatalystTypeConverters`? Looks it is 
a Catalyst converter for beans. Few Java types like `java.lang.Iterable`, 
`java.math.BigDecimal` and `java.math.BigInteger` are being handled there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22527
  
Looks good to me too!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22527
  
Hey @ueshin, sorry I was late. I'm a bit busy for these couple of weeks so 
please don't block by me and ignore me. Thank you for asking it to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22597#discussion_r225014099
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +384,15 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25579 ORC PPD should support column names with dot") {
+import testImplicits._
--- End diff --

Okay. One end to end test should be enough


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22379#discussion_r224985980
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3854,6 +3854,38 @@ object functions {
   @scala.annotation.varargs
   def map_concat(cols: Column*): Column = withExpr { 
MapConcat(cols.map(_.expr)) }
 
+  /**
+   * Parses a column containing a CSV string into a `StructType` with the 
specified schema.
+   * Returns `null`, in the case of an unparseable string.
+   *
+   * @param e a string column containing CSV data.
+   * @param schema the schema to use when parsing the CSV string
+   * @param options options to control how the CSV is parsed. accepts the 
same options and the
+   *CSV data source.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def from_csv(e: Column, schema: StructType, options: Map[String, 
String]): Column = withExpr {
+CsvToStructs(schema, options, e.expr)
+  }
+
+  /**
+   * (Java-specific) Parses a column containing a CSV string into a 
`StructType`
+   * with the specified schema. Returns `null`, in the case of an 
unparseable string.
+   *
+   * @param e a string column containing CSV data.
+   * @param schema the schema to use when parsing the CSV string
+   * @param options options to control how the CSV is parsed. accepts the 
same options and the
+   *CSV data source.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def from_csv(e: Column, schema: String, options: java.util.Map[String, 
String]): Column = {
--- End diff --

Eh, I wasn't following too. Is the problem related to its parameters for 
the function? we can just define `characterOrColumn` and use it. We can do 
something like:

```
if # is character
  schema <- lit(schema)
else if # is column
  schema <- schema
else
  stop("it should be column or characters")
```

like you did in Python side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22219: [SPARK-25224][SQL] Improvement of Spark SQL ThriftServer...

2018-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22219
  
nope not yet. It needs some more review iterations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    5   6   7   8   9   10   11   12   13   14   >