date:20181018

svn commit: r30126 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_00_02-c3eaee7-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Thu Oct 18 07:17:18 2018
New Revision: 30126

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_18_00_02-c3eaee7 docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24601][FOLLOWUP] Update Jackson to 2.9.6 in Kinesis

2018-10-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master c3eaee776 -> 734c6af0d


[SPARK-24601][FOLLOWUP] Update Jackson to 2.9.6 in Kinesis

## What changes were proposed in this pull request?

Also update Kinesis SDK's Jackson to match Spark's

## How was this patch tested?

Existing tests, including Kinesis ones, which ought to be hereby triggered.
This was uncovered, I believe, in 
https://github.com/apache/spark/pull/22729#issuecomment-430666080

Closes #22757 from srowen/SPARK-24601.2.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/734c6af0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/734c6af0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/734c6af0

Branch: refs/heads/master
Commit: 734c6af0dde82310d7ca1c586935274fe661d040
Parents: c3eaee7
Author: Sean Owen 
Authored: Thu Oct 18 07:00:00 2018 -0500
Committer: Sean Owen 
Committed: Thu Oct 18 07:00:00 2018 -0500

--
 external/kinesis-asl/pom.xml | 7 +++
 1 file changed, 7 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/734c6af0/external/kinesis-asl/pom.xml
--
diff --git a/external/kinesis-asl/pom.xml b/external/kinesis-asl/pom.xml
index 032aca9..0aef253 100644
--- a/external/kinesis-asl/pom.xml
+++ b/external/kinesis-asl/pom.xml
@@ -69,6 +69,13 @@
   ${aws.kinesis.producer.version}
   test
 
+
+
+  com.fasterxml.jackson.dataformat
+  jackson-dataformat-cbor
+  ${fasterxml.jackson.version}
+
 
   org.mockito
   mockito-core


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30132 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_08_02-734c6af-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Thu Oct 18 15:16:52 2018
New Revision: 30132

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_18_08_02-734c6af docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25760][SQL] Set AddJarCommand return empty

2018-10-18 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/master 734c6af0d -> 1117fc35f


[SPARK-25760][SQL] Set AddJarCommand return empty

## What changes were proposed in this pull request?

Only `AddJarCommand` return `0`, the user will be confused about what it means. 
This PR sets it to empty.

```sql
spark-sql> add jar 
/Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar;
ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar
0
spark-sql>
```

## How was this patch tested?

manual tests
```sql
spark-sql> add jar 
/Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar;
ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar
spark-sql>
```

Closes #22747 from wangyum/AddJarCommand.

Authored-by: Yuming Wang 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1117fc35
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1117fc35
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1117fc35

Branch: refs/heads/master
Commit: 1117fc35ff11ecc2873b4ec095ad243e8dcb5675
Parents: 734c6af
Author: Yuming Wang 
Authored: Thu Oct 18 09:19:42 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 18 09:19:42 2018 -0700

--
 .../scala/org/apache/spark/sql/execution/command/resources.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1117fc35/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala
index 2e859cf..8fee02a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala
@@ -38,7 +38,7 @@ case class AddJarCommand(path: String) extends 
RunnableCommand {
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
 sparkSession.sessionState.resourceLoader.addJar(path)
-Seq(Row(0))
+Seq.empty[Row]
   }
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test.

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1117fc35f -> e80f18dbd


[SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up 
each test.

## What changes were proposed in this pull request?

Currently each test in `SQLTest` in PySpark is not cleaned properly.
We should introduce and use more `contextmanager` to be convenient to clean up 
the context properly.

## How was this patch tested?

Modified tests.

Closes #22762 from ueshin/issues/SPARK-25763/cleanup_sqltests.

Authored-by: Takuya UESHIN 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e80f18db
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e80f18db
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e80f18db

Branch: refs/heads/master
Commit: e80f18dbd8bc4c2aca9ba6dd487b50e95c55d2e6
Parents: 1117fc3
Author: Takuya UESHIN 
Authored: Fri Oct 19 00:31:01 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 00:31:01 2018 +0800

--
 python/pyspark/sql/tests.py | 556 ++-
 1 file changed, 318 insertions(+), 238 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e80f18db/python/pyspark/sql/tests.py
--
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 8065d82..82dc5a6 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -225,6 +225,63 @@ class SQLTestUtils(object):
 else:
 self.spark.conf.set(key, old_value)
 
+@contextmanager
+def database(self, *databases):
+"""
+A convenient context manager to test with some specific databases. 
This drops the given
+databases if exist and sets current database to "default" when it 
exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for db in databases:
+self.spark.sql("DROP DATABASE IF EXISTS %s CASCADE" % db)
+self.spark.catalog.setCurrentDatabase("default")
+
+@contextmanager
+def table(self, *tables):
+"""
+A convenient context manager to test with some specific tables. This 
drops the given tables
+if exist when it exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for t in tables:
+self.spark.sql("DROP TABLE IF EXISTS %s" % t)
+
+@contextmanager
+def tempView(self, *views):
+"""
+A convenient context manager to test with some specific views. This 
drops the given views
+if exist when it exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for v in views:
+self.spark.catalog.dropTempView(v)
+
+@contextmanager
+def function(self, *functions):
+"""
+A convenient context manager to test with some specific functions. 
This drops the given
+functions if exist when it exits.
+"""
+assert hasattr(self, "spark"), "it should have 'spark' attribute, 
having a spark session."
+
+try:
+yield
+finally:
+for f in functions:
+self.spark.sql("DROP FUNCTION IF EXISTS %s" % f)
+
 
 class ReusedSQLTestCase(ReusedPySparkTestCase, SQLTestUtils):
 @classmethod
@@ -332,6 +389,7 @@ class SQLTests(ReusedSQLTestCase):
 @classmethod
 def setUpClass(cls):
 ReusedSQLTestCase.setUpClass()
+cls.spark.catalog._reset()
 cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
 os.unlink(cls.tempdir.name)
 cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
@@ -347,12 +405,6 @@ class SQLTests(ReusedSQLTestCase):
 sqlContext2 = SQLContext(self.sc)
 self.assertTrue(sqlContext1.sparkSession is sqlContext2.sparkSession)
 
-def tearDown(self):
-super(SQLTests, self).tearDown()
-
-# tear down test_bucketed_write state
-self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")
-
 def test_row_should_be_read_only(self):
 row = Row(a=1, b=2)
 self.assertEqual(1, row.a)
@@ -473,11 +525,12 @@ class SQLTests(ReusedSQLTestCase):
 self.assertEqual(row[0], 4)
 
 def test_udf2(self):
-self.spark.catalog.registerFunction("strlen", lambda string: 
len(string), IntegerType())
-self.spark.createDataFrame(self.sc.parallelize([Row(a="test")]))\
-.createOrReplaceTempVie

spark git commit: [SPARK-25682][K8S] Package example jars in same target for dev and distro images.

2018-10-18 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master e80f18dbd -> 15524c41b


[SPARK-25682][K8S] Package example jars in same target for dev and distro 
images.

This way the image generated from both environments has the same layout,
with just a difference in contents that should not affect functionality.

Also added some minor error checking to the image script.

Closes #22681 from vanzin/SPARK-25682.

Authored-by: Marcelo Vanzin 
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/15524c41
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/15524c41
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/15524c41

Branch: refs/heads/master
Commit: 15524c41b27697c478981cb2df7d5d7df02f3ba4
Parents: e80f18d
Author: Marcelo Vanzin 
Authored: Thu Oct 18 10:21:37 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu Oct 18 10:21:37 2018 -0700

--
 bin/docker-image-tool.sh| 16 
 .../docker/src/main/dockerfiles/spark/Dockerfile|  5 -
 2 files changed, 20 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/15524c41/bin/docker-image-tool.sh
--
diff --git a/bin/docker-image-tool.sh b/bin/docker-image-tool.sh
index 228494d..f17791a 100755
--- a/bin/docker-image-tool.sh
+++ b/bin/docker-image-tool.sh
@@ -47,6 +47,11 @@ function build {
 
   if [ ! -f "$SPARK_HOME/RELEASE" ]; then
 # Set image build arguments accordingly if this is a source repo and not a 
distribution archive.
+#
+# Note that this will copy all of the example jars directory into the 
image, and that will
+# contain a lot of duplicated jars with the main Spark directory. In a 
proper distribution,
+# the examples directory is cleaned up before generating the distribution 
tarball, so this
+# issue does not occur.
 IMG_PATH=resource-managers/kubernetes/docker/src/main/dockerfiles
 BUILD_ARGS=(
   ${BUILD_PARAMS}
@@ -55,6 +60,8 @@ function build {
   --build-arg
   spark_jars=assembly/target/scala-$SPARK_SCALA_VERSION/jars
   --build-arg
+  example_jars=examples/target/scala-$SPARK_SCALA_VERSION/jars
+  --build-arg
   k8s_tests=resource-managers/kubernetes/integration-tests/tests
 )
   else
@@ -78,14 +85,23 @@ function build {
   docker build $NOCACHEARG "${BUILD_ARGS[@]}" \
 -t $(image_ref spark) \
 -f "$BASEDOCKERFILE" .
+  if [[ $? != 0 ]]; then
+error "Failed to build Spark docker image."
+  fi
 
   docker build $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" \
 -t $(image_ref spark-py) \
 -f "$PYDOCKERFILE" .
+  if [[ $? != 0 ]]; then
+error "Failed to build PySpark docker image."
+  fi
 
   docker build $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" \
 -t $(image_ref spark-r) \
 -f "$RDOCKERFILE" .
+  if [[ $? != 0 ]]; then
+error "Failed to build SparkR docker image."
+  fi
 }
 
 function push {

http://git-wip-us.apache.org/repos/asf/spark/blob/15524c41/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
--
diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index 4bada0d..5f469c3 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -18,6 +18,7 @@
 FROM openjdk:8-alpine
 
 ARG spark_jars=jars
+ARG example_jars=examples/jars
 ARG img_path=kubernetes/dockerfiles
 ARG k8s_tests=kubernetes/tests
 
@@ -32,6 +33,7 @@ RUN set -ex && \
 apk upgrade --no-cache && \
 apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs && \
 mkdir -p /opt/spark && \
+mkdir -p /opt/spark/examples && \
 mkdir -p /opt/spark/work-dir && \
 touch /opt/spark/RELEASE && \
 rm /bin/sh && \
@@ -43,7 +45,8 @@ COPY ${spark_jars} /opt/spark/jars
 COPY bin /opt/spark/bin
 COPY sbin /opt/spark/sbin
 COPY ${img_path}/spark/entrypoint.sh /opt/
-COPY examples /opt/spark/examples
+COPY ${example_jars} /opt/spark/examples/jars
+COPY examples/src /opt/spark/examples/src
 COPY ${k8s_tests} /opt/spark/tests
 COPY data /opt/spark/data
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25758][ML] Deprecate computeCost on BisectingKMeans

2018-10-18 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/master 15524c41b -> c2962546d


[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans

## What changes were proposed in this pull request?

The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in 
favor of the adoption of `ClusteringEvaluator` in order to evaluate the 
clustering.

## How was this patch tested?

NA

Closes #22756 from mgaido91/SPARK-25758.

Authored-by: Marco Gaido 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c2962546
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c2962546
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c2962546

Branch: refs/heads/master
Commit: c2962546d9a5900a5628a31b83d2c4b22c3a7936
Parents: 15524c4
Author: Marco Gaido 
Authored: Thu Oct 18 10:32:25 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 18 10:32:25 2018 -0700

--
 .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 5 +
 python/pyspark/ml/clustering.py| 6 ++
 2 files changed, 11 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c2962546/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index 5cb16cc..2243d99 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -125,8 +125,13 @@ class BisectingKMeansModel private[ml] (
   /**
* Computes the sum of squared distances between the input points and their 
corresponding cluster
* centers.
+   *
+   * @deprecated This method is deprecated and will be removed in 3.0.0. Use 
ClusteringEvaluator
+   * instead. You can also get the cost on the training dataset in 
the summary.
*/
   @Since("2.0.0")
+  @deprecated("This method is deprecated and will be removed in 3.0.0. Use 
ClusteringEvaluator " +
+"instead. You can also get the cost on the training dataset in the 
summary.", "2.4.0")
   def computeCost(dataset: Dataset[_]): Double = {
 SchemaUtils.validateVectorCompatibleColumn(dataset.schema, getFeaturesCol)
 val data = DatasetUtils.columnToOldVector(dataset, getFeaturesCol)

http://git-wip-us.apache.org/repos/asf/spark/blob/c2962546/python/pyspark/ml/clustering.py
--
diff --git a/python/pyspark/ml/clustering.py b/python/pyspark/ml/clustering.py
index 5ef4e76..11eb124 100644
--- a/python/pyspark/ml/clustering.py
+++ b/python/pyspark/ml/clustering.py
@@ -540,7 +540,13 @@ class BisectingKMeansModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 """
 Computes the sum of squared distances between the input points
 and their corresponding cluster centers.
+
+..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator instead.
+   You can also get the cost on the training dataset in the summary.
 """
+warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator "
+  "instead. You can also get the cost on the training 
dataset in the summary.",
+  DeprecationWarning)
 return self._call_java("computeCost", dataset)
 
 @property


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25758][ML] Deprecate computeCost on BisectingKMeans

2018-10-18 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 ac9a6f08a -> 71a6a9ce8


[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans

## What changes were proposed in this pull request?

The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in 
favor of the adoption of `ClusteringEvaluator` in order to evaluate the 
clustering.

## How was this patch tested?

NA

Closes #22756 from mgaido91/SPARK-25758.

Authored-by: Marco Gaido 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit c2962546d9a5900a5628a31b83d2c4b22c3a7936)
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/71a6a9ce
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/71a6a9ce
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/71a6a9ce

Branch: refs/heads/branch-2.4
Commit: 71a6a9ce8558913bc410918c14b6799be9baaeb3
Parents: ac9a6f0
Author: Marco Gaido 
Authored: Thu Oct 18 10:32:25 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 18 10:32:37 2018 -0700

--
 .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 5 +
 python/pyspark/ml/clustering.py| 6 ++
 2 files changed, 11 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/71a6a9ce/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
index 5cb16cc..2243d99 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
@@ -125,8 +125,13 @@ class BisectingKMeansModel private[ml] (
   /**
* Computes the sum of squared distances between the input points and their 
corresponding cluster
* centers.
+   *
+   * @deprecated This method is deprecated and will be removed in 3.0.0. Use 
ClusteringEvaluator
+   * instead. You can also get the cost on the training dataset in 
the summary.
*/
   @Since("2.0.0")
+  @deprecated("This method is deprecated and will be removed in 3.0.0. Use 
ClusteringEvaluator " +
+"instead. You can also get the cost on the training dataset in the 
summary.", "2.4.0")
   def computeCost(dataset: Dataset[_]): Double = {
 SchemaUtils.validateVectorCompatibleColumn(dataset.schema, getFeaturesCol)
 val data = DatasetUtils.columnToOldVector(dataset, getFeaturesCol)

http://git-wip-us.apache.org/repos/asf/spark/blob/71a6a9ce/python/pyspark/ml/clustering.py
--
diff --git a/python/pyspark/ml/clustering.py b/python/pyspark/ml/clustering.py
index 5ef4e76..11eb124 100644
--- a/python/pyspark/ml/clustering.py
+++ b/python/pyspark/ml/clustering.py
@@ -540,7 +540,13 @@ class BisectingKMeansModel(JavaModel, JavaMLWritable, 
JavaMLReadable):
 """
 Computes the sum of squared distances between the input points
 and their corresponding cluster centers.
+
+..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator instead.
+   You can also get the cost on the training dataset in the summary.
 """
+warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use 
ClusteringEvaluator "
+  "instead. You can also get the cost on the training 
dataset in the summary.",
+  DeprecationWarning)
 return self._call_java("computeCost", dataset)
 
 @property


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/master c2962546d -> 987f38658


http://git-wip-us.apache.org/repos/asf/spark/blob/987f3865/docs/sql-pyspark-pandas-with-arrow.md
--
diff --git a/docs/sql-pyspark-pandas-with-arrow.md 
b/docs/sql-pyspark-pandas-with-arrow.md
new file mode 100644
index 000..e8e9f55
--- /dev/null
+++ b/docs/sql-pyspark-pandas-with-arrow.md
@@ -0,0 +1,166 @@
+---
+layout: global
+title: PySpark Usage Guide for Pandas with Apache Arrow
+displayTitle: PySpark Usage Guide for Pandas with Apache Arrow
+---
+
+* Table of contents
+{:toc}
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial to 
Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require some 
minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight any 
differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an extra 
dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you must 
ensure that PyArrow
+is installed and available on all cluster nodes. The current supported version 
is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to a 
Pandas DataFrame
+using the call `toPandas()` and when creating a Spark DataFrame from a Pandas 
DataFrame with
+`createDataFrame(pandas_df)`. To use Arrow when executing these calls, users 
need to first set
+the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is 
disabled by default.
+
+In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' 
could fallback automatically
+to non-Arrow optimization implementation if an error occurs before the actual 
computation within Spark.
+This can be controlled by 'spark.sql.execution.arrow.fallback.enabled'.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as when 
Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection of 
all records in the
+DataFrame to the driver program and should be done on a small subset of the 
data. Not all Spark
+data types are currently supported and an error can be raised if a column has 
an unsupported type,
+see [Supported SQL Types](#supported-sql-types). If an error occurs during 
`createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using Arrow 
to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. Currently, 
there are two types of
+Pandas UDF: Scalar and Grouped Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take `pandas.Series` 
as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a Pandas 
UDF by splitting
+columns into batches and calling the function for each batch as a subset of 
the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that computes 
the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Grouped Map
+Grouped map Pandas UDFs are used with `groupBy().apply()` which implements the 
"split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The column labels of the returned `pandas.DataFrame` must either match the 
field names in the
+defined output schema if specified as strings, or match the field data types 
by position if not
+strings, e.g. integer indices. See 
[pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/gene

[2/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

http://git-wip-us.apache.org/repos/asf/spark/blob/987f3865/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index fb03ed2..42b00c9 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -4,11 +4,6 @@ displayTitle: Spark SQL, DataFrames and Datasets Guide
 title: Spark SQL and DataFrames
 ---
 
-* This will become a table of contents (this text will be scraped).
-{:toc}
-
-# Overview
-
 Spark SQL is a Spark module for structured data processing. Unlike the basic 
Spark RDD API, the interfaces provided
 by Spark SQL provide Spark with more information about the structure of both 
the data and the computation being performed. Internally,
 Spark SQL uses this extra information to perform extra optimizations. There 
are several ways to
@@ -24,17 +19,17 @@ the `spark-shell`, `pyspark` shell, or `sparkR` shell.
 
 One use of Spark SQL is to execute SQL queries.
 Spark SQL can also be used to read data from an existing Hive installation. 
For more on how to
-configure this feature, please refer to the [Hive Tables](#hive-tables) 
section. When running
+configure this feature, please refer to the [Hive 
Tables](sql-data-sources-hive-tables.html) section. When running
 SQL from within another programming language the results will be returned as a 
[Dataset/DataFrame](#datasets-and-dataframes).
-You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
-or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
+You can also interact with the SQL interface using the 
[command-line](sql-distributed-sql-engine.html#running-the-spark-sql-cli)
+or over 
[JDBC/ODBC](#sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server).
 
 ## Datasets and DataFrames
 
 A Dataset is a distributed collection of data.
 Dataset is a new interface added in Spark 1.6 that provides the benefits of 
RDDs (strong
 typing, ability to use powerful lambda functions) with the benefits of Spark 
SQL's optimized
-execution engine. A Dataset can be [constructed](#creating-datasets) from JVM 
objects and then
+execution engine. A Dataset can be 
[constructed](sql-getting-started.html#creating-datasets) from JVM objects and 
then
 manipulated using functional transformations (`map`, `flatMap`, `filter`, 
etc.).
 The Dataset API is available in [Scala][scala-datasets] and
 [Java][java-datasets]. Python does not have the support for the Dataset API. 
But due to Python's dynamic nature,
@@ -43,7 +38,7 @@ many of the benefits of the Dataset API are already available 
(i.e. you can acce
 
 A DataFrame is a *Dataset* organized into named columns. It is conceptually
 equivalent to a table in a relational database or a data frame in R/Python, 
but with richer
-optimizations under the hood. DataFrames can be constructed from a wide array 
of [sources](#data-sources) such
+optimizations under the hood. DataFrames can be constructed from a wide array 
of [sources](sql-data-sources.html) such
 as: structured data files, tables in Hive, external databases, or existing 
RDDs.
 The DataFrame API is available in Scala,
 Java, [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
@@ -55,3115 +50,3 @@ While, in [Java API][java-datasets], users need to use 
`Dataset` to represe
 [java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
 
 Throughout this document, we will often refer to Scala/Java Datasets of `Row`s 
as DataFrames.
-
-# Getting Started
-
-## Starting Point: SparkSession
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
-
-{% include_example init_session 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
-
-{% include_example init_session 
java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
-
-{% include_example init_session python/sql/basic.py %}
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
-
-{% include_example init_session r/RSparkSQLExample.R %}
-
-Note that when invoked for the first time, `sparkR.session()` initializes a 
global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only n

[3/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

http://git-wip-us.apache.org/repos/asf/spark/blob/987f3865/docs/sql-data-sources-troubleshooting.md
--
diff --git a/docs/sql-data-sources-troubleshooting.md 
b/docs/sql-data-sources-troubleshooting.md
new file mode 100644
index 000..5775eb8
--- /dev/null
+++ b/docs/sql-data-sources-troubleshooting.md
@@ -0,0 +1,9 @@
+---
+layout: global
+title: Troubleshooting
+displayTitle: Troubleshooting
+---
+
+ * The JDBC driver class must be visible to the primordial class loader on the 
client session and on all executors. This is because Java's DriverManager class 
does a security check that results in it ignoring all drivers not visible to 
the primordial class loader when one goes to open a connection. One convenient 
way to do this is to modify compute_classpath.sh on all worker nodes to include 
your driver JARs.
+ * Some databases, such as H2, convert all names to upper case. You'll need to 
use upper case to refer to those names in Spark SQL.
+ * Users can specify vendor-specific JDBC connection properties in the data 
source options to do special treatment. For example, 
`spark.read.format("jdbc").option("url", 
oracleJdbcUrl).option("oracle.jdbc.mapDateToTimestamp", "false")`. 
`oracle.jdbc.mapDateToTimestamp` defaults to true, users often need to disable 
this flag to avoid Oracle date being resolved as timestamp.

http://git-wip-us.apache.org/repos/asf/spark/blob/987f3865/docs/sql-data-sources.md
--
diff --git a/docs/sql-data-sources.md b/docs/sql-data-sources.md
new file mode 100644
index 000..aa607ec
--- /dev/null
+++ b/docs/sql-data-sources.md
@@ -0,0 +1,42 @@
+---
+layout: global
+title: Data Sources
+displayTitle: Data Sources
+---
+
+
+Spark SQL supports operating on a variety of data sources through the 
DataFrame interface.
+A DataFrame can be operated on using relational transformations and can also 
be used to create a temporary view.
+Registering a DataFrame as a temporary view allows you to run SQL queries over 
its data. This section
+describes the general methods for loading and saving data using the Spark Data 
Sources and then
+goes into specific options that are available for the built-in data sources.
+
+
+* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
+  * [Manually Specifying 
Options](sql-data-sources-load-save-functions.html#manually-specifying-options)
+  * [Run SQL on files 
directly](sql-data-sources-load-save-functions.html#run-sql-on-files-directly)
+  * [Save Modes](sql-data-sources-load-save-functions.html#save-modes)
+  * [Saving to Persistent 
Tables](sql-data-sources-load-save-functions.html#run-sql-on-files-directly)
+  * [Bucketing, Sorting and 
Partitioning](sql-data-sources-load-save-functions.html#run-sql-on-files-directly)
+* [Parquet Files](sql-data-sources-parquet.html)
+  * [Loading Data 
Programmatically](sql-data-sources-parquet.html#loading-data-programmatically)
+  * [Partition Discovery](sql-data-sources-parquet.html#partition-discovery)
+  * [Schema Merging](sql-data-sources-parquet.html#schema-merging)
+  * [Hive metastore Parquet table 
conversion](sql-data-sources-parquet.html#hive-metastore-parquet-table-conversion)
+  * [Configuration](sql-data-sources-parquet.html#configuration)
+* [ORC Files](sql-data-sources-orc.html)
+* [JSON Files](sql-data-sources-json.html)
+* [Hive Tables](sql-data-sources-hive-tables.html)
+  * [Specifying storage format for Hive 
tables](sql-data-sources-hive-tables.html#specifying-storage-format-for-hive-tables)
+  * [Interacting with Different Versions of Hive 
Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)
+* [JDBC To Other Databases](sql-data-sources-jdbc.html)
+* [Avro Files](sql-data-sources-avro.html)
+  * [Deploying](sql-data-sources-avro.html#deploying)
+  * [Load and Save 
Functions](sql-data-sources-avro.html#load-and-save-functions)
+  * [to_avro() and 
from_avro()](sql-data-sources-avro.html#to_avro-and-from_avro)
+  * [Data Source Option](sql-data-sources-avro.html#data-source-option)
+  * [Configuration](sql-data-sources-avro.html#configuration)
+  * [Compatibility with Databricks 
spark-avro](sql-data-sources-avro.html#compatibility-with-databricks-spark-avro)
+  * [Supported types for Avro -> Spark SQL 
conversion](sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion)
+  * [Supported types for Spark SQL -> Avro 
conversion](sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
+* [Troubleshooting](sql-data-sources-troubleshooting.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/987f3865/docs/sql-distributed-sql-engine.md
--
diff --git a/docs/sql-distributed-sql-engine.md 
b/docs/sql-distributed-sql-engine.md
new file mode 100644
index 000..66d6fda
--- /dev/null
+++ b/d

[4/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to 
multiple separate pages

## What changes were proposed in this pull request?

1. Split the main page of sql-programming-guide into 7 parts:

- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference

2. Add left menu for sql-programming-guide, keep first level index for each 
part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)

## How was this patch tested?

Local test with jekyll build/serve.

Closes #22746 from xuanyuanking/SPARK-24499.

Authored-by: Yuanjian Li 
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/987f3865
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/987f3865
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/987f3865

Branch: refs/heads/master
Commit: 987f386588de7311b066cf0f62f0eed64d4aa7d7
Parents: c296254
Author: Yuanjian Li 
Authored: Thu Oct 18 11:59:06 2018 -0700
Committer: gatorsmile 
Committed: Thu Oct 18 11:59:06 2018 -0700

--
 docs/_data/menu-sql.yaml   |   81 +
 docs/_includes/nav-left-wrapper-sql.html   |6 +
 docs/_includes/nav-left.html   |3 +-
 docs/_layouts/global.html  |8 +-
 docs/avro-data-source-guide.md |  380 ---
 docs/ml-pipeline.md|2 +-
 docs/sparkr.md |6 +-
 docs/sql-data-sources-avro.md  |  380 +++
 docs/sql-data-sources-hive-tables.md   |  166 +
 docs/sql-data-sources-jdbc.md  |  223 ++
 docs/sql-data-sources-json.md  |   81 +
 docs/sql-data-sources-load-save-functions.md   |  283 ++
 docs/sql-data-sources-orc.md   |   26 +
 docs/sql-data-sources-parquet.md   |  321 ++
 docs/sql-data-sources-troubleshooting.md   |9 +
 docs/sql-data-sources.md   |   42 +
 docs/sql-distributed-sql-engine.md |   84 +
 docs/sql-getting-started.md|  369 +++
 docs/sql-migration-guide-hive-compatibility.md |  137 +
 docs/sql-migration-guide-upgrade.md|  520 +++
 docs/sql-migration-guide.md|   23 +
 docs/sql-performance-turing.md |  151 +
 docs/sql-programming-guide.md  | 3127 +--
 docs/sql-pyspark-pandas-with-arrow.md  |  166 +
 docs/sql-reference.md  |  641 
 docs/structured-streaming-programming-guide.md |2 +-
 26 files changed, 3727 insertions(+), 3510 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/987f3865/docs/_data/menu-sql.yaml
--
diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
new file mode 100644
index 000..6718763
--- /dev/null
+++ b/docs/_data/menu-sql.yaml
@@ -0,0 +1,81 @@
+- text: Getting Started
+  url: sql-getting-started.html
+  subitems:
+- text: "Starting Point: SparkSession"
+  url: sql-getting-started.html#starting-point-sparksession
+- text: Creating DataFrames
+  url: sql-getting-started.html#creating-dataframes
+- text: Untyped Dataset Operations (DataFrame operations)
+  url: 
sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
+- text: Running SQL Queries Programmatically
+  url: sql-getting-started.html#running-sql-queries-programmatically
+- text: Global Temporary View
+  url: sql-getting-started.html#global-temporary-view
+- text: Creating Datasets
+  url: sql-getting-started.html#creating-datasets
+- text: Interoperating with RDDs
+  url: sql-getting-started.html#interoperating-with-rdds
+- text: Aggregations
+  url: sql-getting-started.html#aggregations
+- text: Data Sources
+  url: sql-data-sources.html
+  subitems:
+- text: "Generic Load/Save Functions"
+  url: sql-data-sources-load-save-functions.html
+- text: Parquet Files
+  url: sql-data-sources-parquet.html
+- text: ORC Files
+  url: sql-data-sources-orc.html
+- text: JSON Files
+  url: sql-data-sources-json.html
+- text: Hive Tables
+  url: sql-data-sources-hive-tables.html
+- text: JDBC To Other Databases
+  url: sql-data-sources-jdbc.html
+- text: Avro Files
+  url: sql-data-sources-avro.html
+- text: Troubleshooting
+  url: sql-data-sources-troubleshooting.html
+- text: Performance Turing
+  url: sql-performance-turing.html
+  subitems:
+- text: Caching Data In Memory
+  url: sql-performance-turing.html#caching-data-in-memory
+

[3/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-data-sources-troubleshooting.md
--
diff --git a/docs/sql-data-sources-troubleshooting.md 
b/docs/sql-data-sources-troubleshooting.md
new file mode 100644
index 000..5775eb8
--- /dev/null
+++ b/docs/sql-data-sources-troubleshooting.md
@@ -0,0 +1,9 @@
+---
+layout: global
+title: Troubleshooting
+displayTitle: Troubleshooting
+---
+
+ * The JDBC driver class must be visible to the primordial class loader on the 
client session and on all executors. This is because Java's DriverManager class 
does a security check that results in it ignoring all drivers not visible to 
the primordial class loader when one goes to open a connection. One convenient 
way to do this is to modify compute_classpath.sh on all worker nodes to include 
your driver JARs.
+ * Some databases, such as H2, convert all names to upper case. You'll need to 
use upper case to refer to those names in Spark SQL.
+ * Users can specify vendor-specific JDBC connection properties in the data 
source options to do special treatment. For example, 
`spark.read.format("jdbc").option("url", 
oracleJdbcUrl).option("oracle.jdbc.mapDateToTimestamp", "false")`. 
`oracle.jdbc.mapDateToTimestamp` defaults to true, users often need to disable 
this flag to avoid Oracle date being resolved as timestamp.

http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-data-sources.md
--
diff --git a/docs/sql-data-sources.md b/docs/sql-data-sources.md
new file mode 100644
index 000..aa607ec
--- /dev/null
+++ b/docs/sql-data-sources.md
@@ -0,0 +1,42 @@
+---
+layout: global
+title: Data Sources
+displayTitle: Data Sources
+---
+
+
+Spark SQL supports operating on a variety of data sources through the 
DataFrame interface.
+A DataFrame can be operated on using relational transformations and can also 
be used to create a temporary view.
+Registering a DataFrame as a temporary view allows you to run SQL queries over 
its data. This section
+describes the general methods for loading and saving data using the Spark Data 
Sources and then
+goes into specific options that are available for the built-in data sources.
+
+
+* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
+  * [Manually Specifying 
Options](sql-data-sources-load-save-functions.html#manually-specifying-options)
+  * [Run SQL on files 
directly](sql-data-sources-load-save-functions.html#run-sql-on-files-directly)
+  * [Save Modes](sql-data-sources-load-save-functions.html#save-modes)
+  * [Saving to Persistent 
Tables](sql-data-sources-load-save-functions.html#run-sql-on-files-directly)
+  * [Bucketing, Sorting and 
Partitioning](sql-data-sources-load-save-functions.html#run-sql-on-files-directly)
+* [Parquet Files](sql-data-sources-parquet.html)
+  * [Loading Data 
Programmatically](sql-data-sources-parquet.html#loading-data-programmatically)
+  * [Partition Discovery](sql-data-sources-parquet.html#partition-discovery)
+  * [Schema Merging](sql-data-sources-parquet.html#schema-merging)
+  * [Hive metastore Parquet table 
conversion](sql-data-sources-parquet.html#hive-metastore-parquet-table-conversion)
+  * [Configuration](sql-data-sources-parquet.html#configuration)
+* [ORC Files](sql-data-sources-orc.html)
+* [JSON Files](sql-data-sources-json.html)
+* [Hive Tables](sql-data-sources-hive-tables.html)
+  * [Specifying storage format for Hive 
tables](sql-data-sources-hive-tables.html#specifying-storage-format-for-hive-tables)
+  * [Interacting with Different Versions of Hive 
Metastore](sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)
+* [JDBC To Other Databases](sql-data-sources-jdbc.html)
+* [Avro Files](sql-data-sources-avro.html)
+  * [Deploying](sql-data-sources-avro.html#deploying)
+  * [Load and Save 
Functions](sql-data-sources-avro.html#load-and-save-functions)
+  * [to_avro() and 
from_avro()](sql-data-sources-avro.html#to_avro-and-from_avro)
+  * [Data Source Option](sql-data-sources-avro.html#data-source-option)
+  * [Configuration](sql-data-sources-avro.html#configuration)
+  * [Compatibility with Databricks 
spark-avro](sql-data-sources-avro.html#compatibility-with-databricks-spark-avro)
+  * [Supported types for Avro -> Spark SQL 
conversion](sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion)
+  * [Supported types for Spark SQL -> Avro 
conversion](sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
+* [Troubleshooting](sql-data-sources-troubleshooting.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-distributed-sql-engine.md
--
diff --git a/docs/sql-distributed-sql-engine.md 
b/docs/sql-distributed-sql-engine.md
new file mode 100644
index 000..66d6fda
--- /dev/null
+++ b/d

[4/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to 
multiple separate pages

1. Split the main page of sql-programming-guide into 7 parts:

- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference

2. Add left menu for sql-programming-guide, keep first level index for each 
part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)

Local test with jekyll build/serve.

Closes #22746 from xuanyuanking/SPARK-24499.

Authored-by: Yuanjian Li 
Signed-off-by: gatorsmile 
(cherry picked from commit 987f386588de7311b066cf0f62f0eed64d4aa7d7)
Signed-off-by: gatorsmile 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/71535516
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/71535516
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/71535516

Branch: refs/heads/branch-2.4
Commit: 71535516419831242fa7fc9177e8f5fdd3c6112b
Parents: 71a6a9c
Author: Yuanjian Li 
Authored: Thu Oct 18 11:59:06 2018 -0700
Committer: gatorsmile 
Committed: Thu Oct 18 12:12:05 2018 -0700

--
 docs/_data/menu-sql.yaml   |   81 +
 docs/_includes/nav-left-wrapper-sql.html   |6 +
 docs/_includes/nav-left.html   |3 +-
 docs/_layouts/global.html  |8 +-
 docs/avro-data-source-guide.md |  380 ---
 docs/ml-pipeline.md|2 +-
 docs/sparkr.md |6 +-
 docs/sql-data-sources-avro.md  |  380 +++
 docs/sql-data-sources-hive-tables.md   |  166 +
 docs/sql-data-sources-jdbc.md  |  223 ++
 docs/sql-data-sources-json.md  |   81 +
 docs/sql-data-sources-load-save-functions.md   |  283 ++
 docs/sql-data-sources-orc.md   |   26 +
 docs/sql-data-sources-parquet.md   |  321 ++
 docs/sql-data-sources-troubleshooting.md   |9 +
 docs/sql-data-sources.md   |   42 +
 docs/sql-distributed-sql-engine.md |   84 +
 docs/sql-getting-started.md|  369 +++
 docs/sql-migration-guide-hive-compatibility.md |  137 +
 docs/sql-migration-guide-upgrade.md|  516 +++
 docs/sql-migration-guide.md|   23 +
 docs/sql-performance-turing.md |  151 +
 docs/sql-programming-guide.md  | 3119 +--
 docs/sql-pyspark-pandas-with-arrow.md  |  166 +
 docs/sql-reference.md  |  641 
 docs/structured-streaming-programming-guide.md |2 +-
 26 files changed, 3723 insertions(+), 3502 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/_data/menu-sql.yaml
--
diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
new file mode 100644
index 000..6718763
--- /dev/null
+++ b/docs/_data/menu-sql.yaml
@@ -0,0 +1,81 @@
+- text: Getting Started
+  url: sql-getting-started.html
+  subitems:
+- text: "Starting Point: SparkSession"
+  url: sql-getting-started.html#starting-point-sparksession
+- text: Creating DataFrames
+  url: sql-getting-started.html#creating-dataframes
+- text: Untyped Dataset Operations (DataFrame operations)
+  url: 
sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
+- text: Running SQL Queries Programmatically
+  url: sql-getting-started.html#running-sql-queries-programmatically
+- text: Global Temporary View
+  url: sql-getting-started.html#global-temporary-view
+- text: Creating Datasets
+  url: sql-getting-started.html#creating-datasets
+- text: Interoperating with RDDs
+  url: sql-getting-started.html#interoperating-with-rdds
+- text: Aggregations
+  url: sql-getting-started.html#aggregations
+- text: Data Sources
+  url: sql-data-sources.html
+  subitems:
+- text: "Generic Load/Save Functions"
+  url: sql-data-sources-load-save-functions.html
+- text: Parquet Files
+  url: sql-data-sources-parquet.html
+- text: ORC Files
+  url: sql-data-sources-orc.html
+- text: JSON Files
+  url: sql-data-sources-json.html
+- text: Hive Tables
+  url: sql-data-sources-hive-tables.html
+- text: JDBC To Other Databases
+  url: sql-data-sources-jdbc.html
+- text: Avro Files
+  url: sql-data-sources-avro.html
+- text: Troubleshooting
+  url: sql-data-sources-troubleshooting.html
+- text: Performance Turing
+  url: sql-performance-turing.html
+  subitems:
+- text: Caching Data In Memory
+  url: sql-performance-turing.html#caching-

[2/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e45e50d..42b00c9 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -4,11 +4,6 @@ displayTitle: Spark SQL, DataFrames and Datasets Guide
 title: Spark SQL and DataFrames
 ---
 
-* This will become a table of contents (this text will be scraped).
-{:toc}
-
-# Overview
-
 Spark SQL is a Spark module for structured data processing. Unlike the basic 
Spark RDD API, the interfaces provided
 by Spark SQL provide Spark with more information about the structure of both 
the data and the computation being performed. Internally,
 Spark SQL uses this extra information to perform extra optimizations. There 
are several ways to
@@ -24,17 +19,17 @@ the `spark-shell`, `pyspark` shell, or `sparkR` shell.
 
 One use of Spark SQL is to execute SQL queries.
 Spark SQL can also be used to read data from an existing Hive installation. 
For more on how to
-configure this feature, please refer to the [Hive Tables](#hive-tables) 
section. When running
+configure this feature, please refer to the [Hive 
Tables](sql-data-sources-hive-tables.html) section. When running
 SQL from within another programming language the results will be returned as a 
[Dataset/DataFrame](#datasets-and-dataframes).
-You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
-or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
+You can also interact with the SQL interface using the 
[command-line](sql-distributed-sql-engine.html#running-the-spark-sql-cli)
+or over 
[JDBC/ODBC](#sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server).
 
 ## Datasets and DataFrames
 
 A Dataset is a distributed collection of data.
 Dataset is a new interface added in Spark 1.6 that provides the benefits of 
RDDs (strong
 typing, ability to use powerful lambda functions) with the benefits of Spark 
SQL's optimized
-execution engine. A Dataset can be [constructed](#creating-datasets) from JVM 
objects and then
+execution engine. A Dataset can be 
[constructed](sql-getting-started.html#creating-datasets) from JVM objects and 
then
 manipulated using functional transformations (`map`, `flatMap`, `filter`, 
etc.).
 The Dataset API is available in [Scala][scala-datasets] and
 [Java][java-datasets]. Python does not have the support for the Dataset API. 
But due to Python's dynamic nature,
@@ -43,7 +38,7 @@ many of the benefits of the Dataset API are already available 
(i.e. you can acce
 
 A DataFrame is a *Dataset* organized into named columns. It is conceptually
 equivalent to a table in a relational database or a data frame in R/Python, 
but with richer
-optimizations under the hood. DataFrames can be constructed from a wide array 
of [sources](#data-sources) such
+optimizations under the hood. DataFrames can be constructed from a wide array 
of [sources](sql-data-sources.html) such
 as: structured data files, tables in Hive, external databases, or existing 
RDDs.
 The DataFrame API is available in Scala,
 Java, [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
@@ -55,3107 +50,3 @@ While, in [Java API][java-datasets], users need to use 
`Dataset` to represe
 [java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
 
 Throughout this document, we will often refer to Scala/Java Datasets of `Row`s 
as DataFrames.
-
-# Getting Started
-
-## Starting Point: SparkSession
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
-
-{% include_example init_session 
scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder()`:
-
-{% include_example init_session 
java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
-
-{% include_example init_session python/sql/basic.py %}
-
-
-
-
-The entry point into all functionality in Spark is the 
[`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic 
`SparkSession`, just call `sparkR.session()`:
-
-{% include_example init_session r/RSparkSQLExample.R %}
-
-Note that when invoked for the first time, `sparkR.session()` initializes a 
global `SparkSession` singleton instance, and always returns a reference to 
this instance for successive invocations. In this way, users only n

[1/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread lixiao

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 71a6a9ce8 -> 715355164


http://git-wip-us.apache.org/repos/asf/spark/blob/71535516/docs/sql-pyspark-pandas-with-arrow.md
--
diff --git a/docs/sql-pyspark-pandas-with-arrow.md 
b/docs/sql-pyspark-pandas-with-arrow.md
new file mode 100644
index 000..e8e9f55
--- /dev/null
+++ b/docs/sql-pyspark-pandas-with-arrow.md
@@ -0,0 +1,166 @@
+---
+layout: global
+title: PySpark Usage Guide for Pandas with Apache Arrow
+displayTitle: PySpark Usage Guide for Pandas with Apache Arrow
+---
+
+* Table of contents
+{:toc}
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial to 
Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require some 
minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight any 
differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an extra 
dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you must 
ensure that PyArrow
+is installed and available on all cluster nodes. The current supported version 
is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to a 
Pandas DataFrame
+using the call `toPandas()` and when creating a Spark DataFrame from a Pandas 
DataFrame with
+`createDataFrame(pandas_df)`. To use Arrow when executing these calls, users 
need to first set
+the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is 
disabled by default.
+
+In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' 
could fallback automatically
+to non-Arrow optimization implementation if an error occurs before the actual 
computation within Spark.
+This can be controlled by 'spark.sql.execution.arrow.fallback.enabled'.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as when 
Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection of 
all records in the
+DataFrame to the driver program and should be done on a small subset of the 
data. Not all Spark
+data types are currently supported and an error can be raised if a column has 
an unsupported type,
+see [Supported SQL Types](#supported-sql-types). If an error occurs during 
`createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using Arrow 
to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. Currently, 
there are two types of
+Pandas UDF: Scalar and Grouped Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take `pandas.Series` 
as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a Pandas 
UDF by splitting
+columns into batches and calling the function for each batch as a subset of 
the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that computes 
the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Grouped Map
+Grouped map Pandas UDFs are used with `groupBy().apply()` which implements the 
"split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The column labels of the returned `pandas.DataFrame` must either match the 
field names in the
+defined output schema if specified as strings, or match the field data types 
by position if not
+strings, e.g. integer indices. See 
[pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/

svn commit: r30136 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_12_02-987f386-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Thu Oct 18 19:17:29 2018
New Revision: 30136

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_18_12_02-987f386 docs


[This commit notification would consist of 1484 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21402][SQL][BACKPORT-2.2] Fix java array of structs deserialization

2018-10-18 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 773c8236c -> 2e3b923e0


[SPARK-21402][SQL][BACKPORT-2.2] Fix java array of structs deserialization

This PR is to backport #22708 to branch 2.2.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type 
of elements is inferred from java bean structure and ends up with mixed up 
field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide 
element type for MapObjects during analysis based on the resolved input data, 
not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

dongjoon-hyun cloud-fan

Closes #22768 from vofque/SPARK-21402-2.2.

Lead-authored-by: Vladimir Kuriatkov 
Co-authored-by: Vladimir Kuriatkov 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2e3b923e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2e3b923e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2e3b923e

Branch: refs/heads/branch-2.2
Commit: 2e3b923e0095d52607670905fd18c11e231b458f
Parents: 773c823
Author: Vladimir Kuriatkov 
Authored: Thu Oct 18 13:39:50 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 18 13:39:50 2018 -0700

--
 .../spark/sql/catalyst/JavaTypeInference.scala  |   6 +-
 .../spark/sql/JavaBeanWithArraySuite.java   | 168 +++
 .../resources/test-data/with-array-fields.json  |   3 +
 3 files changed, 174 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2e3b923e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
index 2698fae..afbf9ce 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
@@ -267,12 +267,12 @@ object JavaTypeInference {
 
   case c if listType.isAssignableFrom(typeToken) =>
 val et = elementType(typeToken)
+
 val array =
   Invoke(
-MapObjects(
+UnresolvedMapObjects(
   p => deserializerFor(et, Some(p)),
-  getPath,
-  inferDataType(et)._1),
+  getPath),
 "array",
 ObjectType(classOf[Array[Any]]))
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e3b923e/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java
new file mode 100644
index 000..1cb8507
--- /dev/null
+++ 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package test.org.apache.spark.sql;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.test.TestSparkSession;
+import org.apache.spark.sql.types.*;
+
+public class JavaBeanWithArraySuite {
+
+  private static final List RECORDS = new ArrayList<>();
+
+  static {
+RECORDS.add(new Record(1, Arrays.asList(new Interval(111, 211), new 
Interval(121, 221;
+RECORDS.add(new Record(2, Arrays.asList(new Interval(112, 212), new 
Interval(122, 222;
+RECORDS.add(new Record(3, Arrays.asList(new Interval(113, 213), new 
Interval(123, 223;
+

spark git commit: [SPARK-24499][DOC][FOLLOW-UP] Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 715355164 -> fd5b24726


[SPARK-24499][DOC][FOLLOW-UP] Split the page of sql-programming-guide.html to 
multiple separate pages

## What changes were proposed in this pull request?
Forgot to clean remove the link for `Upgrading From Spark SQL 2.4 to 3.0` when 
merging to 2.4

## How was this patch tested?
N/A

Closes #22769 from gatorsmile/test2.4.

Authored-by: gatorsmile 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd5b2472
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd5b2472
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd5b2472

Branch: refs/heads/branch-2.4
Commit: fd5b247262761271ac36d67fe66f7814acc664a9
Parents: 7153551
Author: gatorsmile 
Authored: Thu Oct 18 13:51:13 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 18 13:51:13 2018 -0700

--
 docs/sql-migration-guide.md | 1 -
 1 file changed, 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fd5b2472/docs/sql-migration-guide.md
--
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 71d83e8..a3fc52c 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -5,7 +5,6 @@ displayTitle: Migration Guide
 ---
 
 * [Spark SQL Upgrading Guide](sql-migration-guide-upgrade.html)
-  * [Upgrading From Spark SQL 2.4 to 
3.0](sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-30)
   * [Upgrading From Spark SQL 2.3 to 
2.4](sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24)
   * [Upgrading From Spark SQL 2.3.0 to 2.3.1 and 
above](sql-migration-guide-upgrade.html#upgrading-from-spark-sql-230-to-231-and-above)
   * [Upgrading From Spark SQL 2.2 to 
2.3](sql-migration-guide-upgrade.html#upgrading-from-spark-sql-22-to-23)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30144 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_18_14_02-fd5b247-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Thu Oct 18 21:17:11 2018
New Revision: 30144

Log:
Apache Spark 2.4.1-SNAPSHOT-2018_10_18_14_02-fd5b247 docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21402][SQL][BACKPORT-2.3] Fix java array of structs deserialization

2018-10-18 Thread dongjoon

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 0726bc56f -> 61b301cc7


[SPARK-21402][SQL][BACKPORT-2.3] Fix java array of structs deserialization

This PR is to backport #22708 to branch 2.3.

## What changes were proposed in this pull request?

MapObjects expression is used to map array elements to java beans. Struct type 
of elements is inferred from java bean structure and ends up with mixed up 
field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide 
element type for MapObjects during analysis based on the resolved input data, 
not on the java bean.

## How was this patch tested?

Added a test case.
Built complete project on travis.

dongjoon-hyun cloud-fan

Closes #22767 from vofque/SPARK-21402-2.3.

Authored-by: Vladimir Kuriatkov 
Signed-off-by: Dongjoon Hyun 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/61b301cc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/61b301cc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/61b301cc

Branch: refs/heads/branch-2.3
Commit: 61b301cc7bf3fce4c034be3171291d5212c386e1
Parents: 0726bc5
Author: Vladimir Kuriatkov 
Authored: Thu Oct 18 14:46:03 2018 -0700
Committer: Dongjoon Hyun 
Committed: Thu Oct 18 14:46:03 2018 -0700

--
 .../spark/sql/catalyst/JavaTypeInference.scala  |   3 +-
 .../spark/sql/JavaBeanWithArraySuite.java   | 154 +++
 .../resources/test-data/with-array-fields.json  |   3 +
 3 files changed, 158 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/61b301cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
index 3ecc137..7a226d7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
@@ -271,10 +271,9 @@ object JavaTypeInference {
 
   case c if listType.isAssignableFrom(typeToken) =>
 val et = elementType(typeToken)
-MapObjects(
+UnresolvedMapObjects(
   p => deserializerFor(et, Some(p)),
   getPath,
-  inferDataType(et)._1,
   customCollectionCls = Some(c))
 
   case _ if mapType.isAssignableFrom(typeToken) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/61b301cc/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java
new file mode 100644
index 000..70dd110
--- /dev/null
+++ 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanWithArraySuite.java
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package test.org.apache.spark.sql;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.test.TestSparkSession;
+
+public class JavaBeanWithArraySuite {
+
+  private static final List RECORDS = new ArrayList<>();
+
+  static {
+RECORDS.add(new Record(1, Arrays.asList(new Interval(111, 211), new 
Interval(121, 221;
+RECORDS.add(new Record(2, Arrays.asList(new Interval(112, 212), new 
Interval(122, 222;
+RECORDS.add(new Record(3, Arrays.asList(new Interval(113, 213), new 
Interval(123, 223;
+  }
+
+  private TestSparkSession spark;
+
+  @Before
+  public void setUp() {
+spark = new TestSparkSession();
+  }
+
+  @After
+  pub

spark git commit: [SPARK-25683][CORE] Updated the log for the firstTime event Drop occurs

2018-10-18 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master 987f38658 -> f704ebe90


[SPARK-25683][CORE] Updated the log for the firstTime event Drop occurs

## What changes were proposed in this pull request?

When the first dropEvent occurs, LastReportTimestamp was printing in the log as
Wed Dec 31 16:00:00 PST 1969
(Dropped 1 events from eventLog since Wed Dec 31 16:00:00 PST 1969.)

The reason is that lastReportTimestamp initialized with 0.

Now log is updated to print "... since the application starts" if 
'lastReportTimestamp' == 0.
this will happens first dropEvent occurs.

## How was this patch tested?
Manually verified.

Closes #22677 from shivusondur/AsyncEvent1.

Authored-by: shivusondur 
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f704ebe9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f704ebe9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f704ebe9

Branch: refs/heads/master
Commit: f704ebe9026ac065416d50adcb9c807c7c7a4102
Parents: 987f386
Author: shivusondur 
Authored: Thu Oct 18 15:05:56 2018 -0700
Committer: Marcelo Vanzin 
Committed: Thu Oct 18 15:05:56 2018 -0700

--
 .../main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f704ebe9/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala 
b/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala
index e2b6df4..7cd2b86 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala
@@ -169,7 +169,8 @@ private class AsyncEventQueue(
   val prevLastReportTimestamp = lastReportTimestamp
   lastReportTimestamp = System.currentTimeMillis()
   val previous = new java.util.Date(prevLastReportTimestamp)
-  logWarning(s"Dropped $droppedCount events from $name since 
$previous.")
+  logWarning(s"Dropped $droppedCount events from $name since " +
+s"${if (prevLastReportTimestamp == 0) "the application started" 
else s"$previous"}.")
 }
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30145 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_16_03-f704ebe-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Thu Oct 18 23:17:41 2018
New Revision: 30145

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_18_16_03-f704ebe docs


[This commit notification would consist of 1484 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30151 - in /dev/spark/2.3.3-SNAPSHOT-2018_10_18_18_02-61b301c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Fri Oct 19 01:16:14 2018
New Revision: 30151

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_10_18_18_02-61b301c docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator

2018-10-18 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master f704ebe90 -> d0ecff285


[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use 
ClusteringEvaluator

## What changes were proposed in this pull request?

The PR updates the examples for `BisectingKMeans` so that they don't use the 
deprecated method `computeCost` (see SPARK-25758).

## How was this patch tested?

running examples

Closes #22763 from mgaido91/SPARK-25764.

Authored-by: Marco Gaido 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d0ecff28
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d0ecff28
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d0ecff28

Branch: refs/heads/master
Commit: d0ecff28545ac81f5ba7ac06957ced65b6e3ebcd
Parents: f704ebe
Author: Marco Gaido 
Authored: Fri Oct 19 09:33:46 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 19 09:33:46 2018 +0800

--
 .../spark/examples/ml/JavaBisectingKMeansExample.java   | 12 +---
 .../src/main/python/ml/bisecting_k_means_example.py | 12 +---
 .../spark/examples/ml/BisectingKMeansExample.scala  | 12 +---
 3 files changed, 27 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d0ecff28/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
index 8c82aaa..f517dc3 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
@@ -20,6 +20,7 @@ package org.apache.spark.examples.ml;
 // $example on$
 import org.apache.spark.ml.clustering.BisectingKMeans;
 import org.apache.spark.ml.clustering.BisectingKMeansModel;
+import org.apache.spark.ml.evaluation.ClusteringEvaluator;
 import org.apache.spark.ml.linalg.Vector;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
@@ -50,9 +51,14 @@ public class JavaBisectingKMeansExample {
 BisectingKMeans bkm = new BisectingKMeans().setK(2).setSeed(1);
 BisectingKMeansModel model = bkm.fit(dataset);
 
-// Evaluate clustering.
-double cost = model.computeCost(dataset);
-System.out.println("Within Set Sum of Squared Errors = " + cost);
+// Make predictions
+Dataset predictions = model.transform(dataset);
+
+// Evaluate clustering by computing Silhouette score
+ClusteringEvaluator evaluator = new ClusteringEvaluator();
+
+double silhouette = evaluator.evaluate(predictions);
+System.out.println("Silhouette with squared euclidean distance = " + 
silhouette);
 
 // Shows the result.
 System.out.println("Cluster Centers: ");

http://git-wip-us.apache.org/repos/asf/spark/blob/d0ecff28/examples/src/main/python/ml/bisecting_k_means_example.py
--
diff --git a/examples/src/main/python/ml/bisecting_k_means_example.py 
b/examples/src/main/python/ml/bisecting_k_means_example.py
index 7842d20..82adb33 100644
--- a/examples/src/main/python/ml/bisecting_k_means_example.py
+++ b/examples/src/main/python/ml/bisecting_k_means_example.py
@@ -24,6 +24,7 @@ from __future__ import print_function
 
 # $example on$
 from pyspark.ml.clustering import BisectingKMeans
+from pyspark.ml.evaluation import ClusteringEvaluator
 # $example off$
 from pyspark.sql import SparkSession
 
@@ -41,9 +42,14 @@ if __name__ == "__main__":
 bkm = BisectingKMeans().setK(2).setSeed(1)
 model = bkm.fit(dataset)
 
-# Evaluate clustering.
-cost = model.computeCost(dataset)
-print("Within Set Sum of Squared Errors = " + str(cost))
+# Make predictions
+predictions = model.transform(dataset)
+
+# Evaluate clustering by computing Silhouette score
+evaluator = ClusteringEvaluator()
+
+silhouette = evaluator.evaluate(predictions)
+print("Silhouette with squared euclidean distance = " + str(silhouette))
 
 # Shows the result.
 print("Cluster Centers: ")

http://git-wip-us.apache.org/repos/asf/spark/blob/d0ecff28/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
index 5f8f2c9..14e13df 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
+++ 
b/examples/src/main/scala/or

spark git commit: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator

2018-10-18 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 fd5b24726 -> 36307b1e4


[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use 
ClusteringEvaluator

## What changes were proposed in this pull request?

The PR updates the examples for `BisectingKMeans` so that they don't use the 
deprecated method `computeCost` (see SPARK-25758).

## How was this patch tested?

running examples

Closes #22763 from mgaido91/SPARK-25764.

Authored-by: Marco Gaido 
Signed-off-by: Wenchen Fan 
(cherry picked from commit d0ecff28545ac81f5ba7ac06957ced65b6e3ebcd)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/36307b1e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/36307b1e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/36307b1e

Branch: refs/heads/branch-2.4
Commit: 36307b1e4b42ce22b07e7a3fc2679c4b5e7c34c8
Parents: fd5b247
Author: Marco Gaido 
Authored: Fri Oct 19 09:33:46 2018 +0800
Committer: Wenchen Fan 
Committed: Fri Oct 19 09:34:25 2018 +0800

--
 .../spark/examples/ml/JavaBisectingKMeansExample.java   | 12 +---
 .../src/main/python/ml/bisecting_k_means_example.py | 12 +---
 .../spark/examples/ml/BisectingKMeansExample.scala  | 12 +---
 3 files changed, 27 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/36307b1e/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
index 8c82aaa..f517dc3 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java
@@ -20,6 +20,7 @@ package org.apache.spark.examples.ml;
 // $example on$
 import org.apache.spark.ml.clustering.BisectingKMeans;
 import org.apache.spark.ml.clustering.BisectingKMeansModel;
+import org.apache.spark.ml.evaluation.ClusteringEvaluator;
 import org.apache.spark.ml.linalg.Vector;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
@@ -50,9 +51,14 @@ public class JavaBisectingKMeansExample {
 BisectingKMeans bkm = new BisectingKMeans().setK(2).setSeed(1);
 BisectingKMeansModel model = bkm.fit(dataset);
 
-// Evaluate clustering.
-double cost = model.computeCost(dataset);
-System.out.println("Within Set Sum of Squared Errors = " + cost);
+// Make predictions
+Dataset predictions = model.transform(dataset);
+
+// Evaluate clustering by computing Silhouette score
+ClusteringEvaluator evaluator = new ClusteringEvaluator();
+
+double silhouette = evaluator.evaluate(predictions);
+System.out.println("Silhouette with squared euclidean distance = " + 
silhouette);
 
 // Shows the result.
 System.out.println("Cluster Centers: ");

http://git-wip-us.apache.org/repos/asf/spark/blob/36307b1e/examples/src/main/python/ml/bisecting_k_means_example.py
--
diff --git a/examples/src/main/python/ml/bisecting_k_means_example.py 
b/examples/src/main/python/ml/bisecting_k_means_example.py
index 7842d20..82adb33 100644
--- a/examples/src/main/python/ml/bisecting_k_means_example.py
+++ b/examples/src/main/python/ml/bisecting_k_means_example.py
@@ -24,6 +24,7 @@ from __future__ import print_function
 
 # $example on$
 from pyspark.ml.clustering import BisectingKMeans
+from pyspark.ml.evaluation import ClusteringEvaluator
 # $example off$
 from pyspark.sql import SparkSession
 
@@ -41,9 +42,14 @@ if __name__ == "__main__":
 bkm = BisectingKMeans().setK(2).setSeed(1)
 model = bkm.fit(dataset)
 
-# Evaluate clustering.
-cost = model.computeCost(dataset)
-print("Within Set Sum of Squared Errors = " + str(cost))
+# Make predictions
+predictions = model.transform(dataset)
+
+# Evaluate clustering by computing Silhouette score
+evaluator = ClusteringEvaluator()
+
+silhouette = evaluator.evaluate(predictions)
+print("Silhouette with squared euclidean distance = " + str(silhouette))
 
 # Shows the result.
 print("Cluster Centers: ")

http://git-wip-us.apache.org/repos/asf/spark/blob/36307b1e/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
--
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
index 5f8f2c9..14e13df 100644
--- 
a/examples/s

spark git commit: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master d0ecff285 -> 1e6c1d8bf


[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode

## What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work 
fine in single line mode because the line separation is done by Hadoop, which 
can handle all the different types of line separators. This PR fixes it by 
enabling Univocity's line separator detection in multiline mode, which will 
detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single 
line mode.

## How was this patch tested?

Unit test with a file with crlf line endings.

Closes #22503 from justinuang/fix-clrf-multiline.

Authored-by: Justin Uang 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e6c1d8b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e6c1d8b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e6c1d8b

Branch: refs/heads/master
Commit: 1e6c1d8bfb7841596452e25b870823b9a4b267f4
Parents: d0ecff2
Author: Justin Uang 
Authored: Fri Oct 19 11:13:02 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 11:13:02 2018 +0800

--
 .../org/apache/spark/sql/catalyst/csv/CSVOptions.scala  |  2 ++
 sql/core/src/test/resources/test-data/cars-crlf.csv |  7 +++
 .../spark/sql/execution/datasources/csv/CSVSuite.scala  | 12 
 3 files changed, 21 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
index 3e25d82..cdaaa17 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
@@ -212,6 +212,8 @@ class CSVOptions(
 settings.setEmptyValue(emptyValueInRead)
 settings.setMaxCharsPerColumn(maxCharsPerColumn)
 
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+settings.setLineSeparatorDetectionEnabled(multiLine == true)
+
 settings
   }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/core/src/test/resources/test-data/cars-crlf.csv
--
diff --git a/sql/core/src/test/resources/test-data/cars-crlf.csv 
b/sql/core/src/test/resources/test-data/cars-crlf.csv
new file mode 100644
index 000..d018d08
--- /dev/null
+++ b/sql/core/src/test/resources/test-data/cars-crlf.csv
@@ -0,0 +1,7 @@
+
+year,make,model,comment,blank
+"2012","Tesla","S","No comment",
+
+1997,Ford,E350,"Go get one now they are going fast",
+2015,Chevy,Volt
+

http://git-wip-us.apache.org/repos/asf/spark/blob/1e6c1d8b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index d59035b..d43efc8 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -52,6 +52,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with 
SQLTestUtils with Te
   private val carsNullFile = "test-data/cars-null.csv"
   private val carsEmptyValueFile = "test-data/cars-empty-value.csv"
   private val carsBlankColName = "test-data/cars-blank-column-name.csv"
+  private val carsCrlf = "test-data/cars-crlf.csv"
   private val emptyFile = "test-data/empty.csv"
   private val commentsFile = "test-data/comments.csv"
   private val disableCommentsFile = "test-data/disable_comments.csv"
@@ -220,6 +221,17 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
 }
   }
 
+  test("crlf line separators in multiline mode") {
+val cars = spark
+  .read
+  .format("csv")
+  .option("multiLine", "true")
+  .option("header", "true")
+  .load(testFile(carsCrlf))
+
+verifyCars(cars, withHeader = true)
+  }
+
   test("test aliases sep and encoding for delimiter and charset") {
 // scalastyle:off
 val cars = spark


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30152 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_20_02-d0ecff2-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Fri Oct 19 03:17:03 2018
New Revision: 30152

Log:
Apache Spark 3.0.0-SNAPSHOT-2018_10_18_20_02-d0ecff2 docs


[This commit notification would consist of 1484 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r30153 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_18_22_02-36307b1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-10-18 Thread pwendell

Author: pwendell
Date: Fri Oct 19 05:16:45 2018
New Revision: 30153

Log:
Apache Spark 2.4.1-SNAPSHOT-2018_10_18_22_02-36307b1 docs


[This commit notification would consist of 1478 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master 1e6c1d8bf -> c8f7691c6


[MINOR][DOC] Spacing items in migration guide for readability and consistency

## What changes were proposed in this pull request?

Currently, migration guide has no space between each item which looks too 
compact and hard to read. Some of items already had some spaces between them in 
the migration guide. This PR suggest to format them consistently for 
readability.

Before:

![screen shot 2018-10-18 at 10 00 04 
am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png)

After:

![screen shot 2018-10-18 at 9 53 55 
am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png)

## How was this patch tested?

Manually tested:

Closes #22761 from HyukjinKwon/minor-migration-doc.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c8f7691c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c8f7691c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c8f7691c

Branch: refs/heads/master
Commit: c8f7691c64a28174a54e8faa159b50a3836a7225
Parents: 1e6c1d8
Author: hyukjinkwon 
Authored: Fri Oct 19 13:55:27 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 13:55:27 2018 +0800

--
 docs/sql-migration-guide-upgrade.md | 54 
 1 file changed, 54 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c8f7691c/docs/sql-migration-guide-upgrade.md
--
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 7faf8bd..7871a49 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -74,26 +74,47 @@ displayTitle: Spark SQL Upgrading Guide
   
 
   - Since Spark 2.4, when there is a struct field in front of the IN operator 
before a subquery, the inner query must contain a struct field as well. In 
previous versions, instead, the fields of the struct were compared to the 
output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 
2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a 
in (select 1, 'a' from range(1))` is not. In previous version it was the 
opposite.
+
   - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, 
then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became 
case-sensitive and would resolve to columns (unless typed in lower case). In 
Spark 2.4 this has been fixed and the functions are no longer case-sensitive.
+
   - Since Spark 2.4, Spark will evaluate the set operations referenced in a 
query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+
   - Since Spark 2.4, Spark will display table description column Last Access 
value as UNKNOWN when the value was Jan 01 1970.
+
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for 
ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
+
   - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+
   - Since Spark 2.4, expression IDs in UDF arguments do not appear in column 
names. For example,

spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency

2018-10-18 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 36307b1e4 -> 9ed2e4204


[MINOR][DOC] Spacing items in migration guide for readability and consistency

## What changes were proposed in this pull request?

Currently, migration guide has no space between each item which looks too 
compact and hard to read. Some of items already had some spaces between them in 
the migration guide. This PR suggest to format them consistently for 
readability.

Before:

![screen shot 2018-10-18 at 10 00 04 
am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png)

After:

![screen shot 2018-10-18 at 9 53 55 
am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png)

## How was this patch tested?

Manually tested:

Closes #22761 from HyukjinKwon/minor-migration-doc.

Authored-by: hyukjinkwon 
Signed-off-by: hyukjinkwon 
(cherry picked from commit c8f7691c64a28174a54e8faa159b50a3836a7225)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9ed2e420
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9ed2e420
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9ed2e420

Branch: refs/heads/branch-2.4
Commit: 9ed2e42044a1105a1c8b5868dbb320b1b477bcf0
Parents: 36307b1
Author: hyukjinkwon 
Authored: Fri Oct 19 13:55:27 2018 +0800
Committer: hyukjinkwon 
Committed: Fri Oct 19 13:55:43 2018 +0800

--
 docs/sql-migration-guide-upgrade.md | 54 
 1 file changed, 54 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9ed2e420/docs/sql-migration-guide-upgrade.md
--
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 3476aa8..062e07b 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -70,26 +70,47 @@ displayTitle: Spark SQL Upgrading Guide
   
 
   - Since Spark 2.4, when there is a struct field in front of the IN operator 
before a subquery, the inner query must contain a struct field as well. In 
previous versions, instead, the fields of the struct were compared to the 
output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 
2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a 
in (select 1, 'a' from range(1))` is not. In previous version it was the 
opposite.
+
   - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, 
then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became 
case-sensitive and would resolve to columns (unless typed in lower case). In 
Spark 2.4 this has been fixed and the functions are no longer case-sensitive.
+
   - Since Spark 2.4, Spark will evaluate the set operations referenced in a 
query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
+
   - Since Spark 2.4, Spark will display table description column Last Access 
value as UNKNOWN when the value was Jan 01 1970.
+
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for 
ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
+
   - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty datafra

svn commit: r30126 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_00_02-c3eaee7-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-24601][FOLLOWUP] Update Jackson to 2.9.6 in Kinesis

svn commit: r30132 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_08_02-734c6af-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-25760][SQL] Set AddJarCommand return empty

spark git commit: [SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test.

spark git commit: [SPARK-25682][K8S] Package example jars in same target for dev and distro images.

spark git commit: [SPARK-25758][ML] Deprecate computeCost on BisectingKMeans

spark git commit: [SPARK-25758][ML] Deprecate computeCost on BisectingKMeans

[1/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[2/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[3/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[4/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[3/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[4/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[2/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

[1/4] spark git commit: [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

svn commit: r30136 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_12_02-987f386-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-21402][SQL][BACKPORT-2.2] Fix java array of structs deserialization

spark git commit: [SPARK-24499][DOC][FOLLOW-UP] Split the page of sql-programming-guide.html to multiple separate pages

svn commit: r30144 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_18_14_02-fd5b247-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-21402][SQL][BACKPORT-2.3] Fix java array of structs deserialization

spark git commit: [SPARK-25683][CORE] Updated the log for the firstTime event Drop occurs

svn commit: r30145 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_16_03-f704ebe-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r30151 - in /dev/spark/2.3.3-SNAPSHOT-2018_10_18_18_02-61b301c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator

spark git commit: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator

spark git commit: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode

svn commit: r30152 - in /dev/spark/3.0.0-SNAPSHOT-2018_10_18_20_02-d0ecff2-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r30153 - in /dev/spark/2.4.1-SNAPSHOT-2018_10_18_22_02-36307b1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency

spark git commit: [MINOR][DOC] Spacing items in migration guide for readability and consistency

31 matches

Site Navigation

Mail list logo

Footer information