[GitHub] spark pull request #22319: [SPARK-25044][SQL][followup] add back UserDefined...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22319#discussion_r215142933
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -129,3 +138,17 @@ case class UserDefinedFunction protected[sql] (
 }
   }
 }
+
+// We have to use a name different than `UserDefinedFunction` here, to 
avoid breaking the binary
+// compatibility of the auto-generate UserDefinedFunction object.
+private[sql] object SparkUserDefinedFunction {
--- End diff --

Oh, so we're working around by new object here. SGTM. I assumed mutable 
status and copy stuff but trivial or not an issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22319
  
**[Test build #95696 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95696/testReport)**
 for PR 22319 at commit 
[`9e060a4`](https://github.com/apache/spark/commit/9e060a4cc9360a0ebf59db05c0e1466b8e66b157).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2857/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21756: [SPARK-24764] [CORE] Add ServiceLoader implementation fo...

2018-09-04 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21756
  
I think the use case here is quite specific, I'm not sure if it is a good 
idea to make `SparkHadoopUtil` ServiceLoader-able to support your requirement. 
Typically I don't think user has a such requirement to build their own 
`SparkHadoopUtil`.

I'm wondering do we have other solutions or workarounds to support your use 
case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22319
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95691/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22319
  
**[Test build #95691 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95691/testReport)**
 for PR 22319 at commit 
[`9e060a4`](https://github.com/apache/spark/commit/9e060a4cc9360a0ebf59db05c0e1466b8e66b157).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22334: [SPARK-25336][SS]Revert SPARK-24863 and SPARK-247...

2018-09-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22334


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22334: [SPARK-25336][SS]Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22334
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
That doesn't work for Java, if I remember the issue correctly.

On Tue, Sep 4, 2018, 10:31 PM Wenchen Fan  wrote:

> *@cloud-fan* commented on this pull request.
> --
>
> In
> 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
> :
>
> > + *
> + *   // Java
> + *   Dataset df = spark.read().format("image")
> + * .option("dropImageFailures", "true")
> + * .load("data/mllib/images/imagesWithPartitions");
> + * }}}
> + *
> + * IMAGE data source supports the following options:
> + *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
> + *
> + * @note This IMAGE data source does not support "write".
> + *
> + * @note This class is public for documentation purpose. Please don't 
use this class directly.
> + * Rather, use the data source API as illustrated above.
> + */
> +class ImageDataSource private() {}
>
> Is this a convention? AFAIK in the scala world we usually put document in
> package object.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22319
  
I've fixed all the compatibility issues. Is there something else we want to 
let users know?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215140040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

Is this a convention? AFAIK in the scala world we usually put document in 
package object.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17174: [SPARK-19145][SQL] Timestamp to String casting is slowin...

2018-09-04 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/17174
  
@tanejagagan Are you still working on?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...

2018-09-04 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22336
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22336
  
I'm surprised this benchmark is written as a test suite.

I'm ok with this PR, but we should refactor this benchmark to use `main` 
method, like `HashBenchmark`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95695 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95695/testReport)**
 for PR 22328 at commit 
[`4d52754`](https://github.com/apache/spark/commit/4d527548bb7db3f2f3c1ddda206e552d223ca27b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22334: [SPARK-25336][SS]Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22334
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95689/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22334: [SPARK-25336][SS]Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22334
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2856/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22334: [SPARK-25336][SS]Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22334
  
**[Test build #95689 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95689/testReport)**
 for PR 22334 at commit 
[`3d59df1`](https://github.com/apache/spark/commit/3d59df12fa407e998bbca9f83dd374e552849da6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215139263
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -567,6 +567,7 @@ object DataSource extends Logging {
 val parquet = classOf[ParquetFileFormat].getCanonicalName
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val image = "org.apache.spark.ml.source.image.ImageFileFormat"
--- End diff --

We did it for libsvm for historical reasons. Since image is a new source, I 
don't think we have compatibility issues.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215139063
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
+  }
+
+  override def shortName(): String = "image"
+
+  override protected def buildReader(
+  sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): (PartitionedFile) => 
Iterator[InternalRow] = {
--- End diff --

sample pushdown should be supported by data source v2 in the next release, 
then we can migrate the image source to data source v2 at that time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

for doc.
similar to `LibSVMDataSource`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138931
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

why do we need this class?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138889
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -567,6 +567,7 @@ object DataSource extends Logging {
 val parquet = classOf[ParquetFileFormat].getCanonicalName
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val image = "org.apache.spark.ml.source.image.ImageFileFormat"
--- End diff --

similar to `libsvm`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138862
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
+val images = spark.read.format("image").option("dropImageFailures", 
"true")
+  .load(imagePath + "/cls=multichannel/").collect()
+
+val firstBytes20Map = images.map { rrow =>
+  val row = rrow.getAs[Row]("image")
+  val filename = Paths.get(getOrigin(row)).getFileName().toString()
+  val mode = getMode(row)
+  val bytes20 = getData(row).slice(0, 20).toList
+  filename -> Tuple2(mode, bytes20)
--- End diff --

yea, `(mode, bytes20)` doesn't work, `filename -> (mode, bytes20)` will be 
compiled as `filename.->(mode, bytes20)` and `->` receive 2 arguments and 
compile error occurs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138728
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138711
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
--- End diff --

option API require (k: String, v:String) parameters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138305
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -29,7 +29,7 @@ package org.apache.spark.ml.source.image
  *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
  *
  * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
- * optionally specify options, for example:
+ * optionally specify the datasource options, for example:
--- End diff --

s/datasource/data source


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138635
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+private[image] class ImageOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
--- End diff --

Why `false` is a String not a boolean?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138476
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -45,6 +45,8 @@ package org.apache.spark.ml.source.image
  * IMAGE data source supports the following options:
  *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
  *
+ * @note This IMAGE data source does not support "write".
--- End diff --

s/"write"/saving images to a file(s)/ ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95694 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95694/testReport)**
 for PR 22328 at commit 
[`bd6178c`](https://github.com/apache/spark/commit/bd6178c0bff7c5ffade1dce61894191f63a50976).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22336
  
**[Test build #95693 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95693/testReport)**
 for PR 22336 at commit 
[`69f207f`](https://github.com/apache/spark/commit/69f207f8a4531435c4a8df790780557968a33bb1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2855/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22336
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2854/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22336
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138174
  
--- Diff: data/mllib/images/images/license.txt ---
@@ -0,0 +1,13 @@
+The images in the folder "kittens" are under the creative commons CC0 
license, or no rights reserved:
--- End diff --

No worry. Only very few images. And keep old testcase not changed will help 
this PR get merged ASAP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22319
  
Shall we update migration guide about the compatibility?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` i...

2018-09-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22336
  
Sorry, @cloud-fan . I forgot to turn off the benchmark test in the previous 
PR. We need to disable it like the other micro benchmark test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22234
  
This is rather a quite corner case (see the elaborated cases in the JIRA 
[SPARK-17916](https://issues.apache.org/jira/browse/SPARK-17916)) and there's 
ambiguity to treat this as a bug or a proper behaviour change; however, I don't 
object if this can be worth enough as something that should be mentioned.

cc @MaxGekk for a followup


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22336: [SPARK-25306][SQL][FOLLOWUP] Change `test` to `ig...

2018-09-04 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/22336

[SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` in 
FilterPushdownBenchmark

## What changes were proposed in this pull request?

This is a follow-up of #22313 and aim to ignore the micro benchmark test 
which takes over 2 minutes in Jenkins.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4939/consoleFull

## How was this patch tested?

The test case should be ignored in Jenkins.
```
[info] FilterPushdownBenchmark:
...
[info] - Pushdown benchmark with many filters !!! IGNORED !!!
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-25306-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22336.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22336


commit 69f207f8a4531435c4a8df790780557968a33bb1
Author: Dongjoon Hyun 
Date:   2018-09-05T05:01:18Z

[SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` in 
FilterPushdownBenchmark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22335: [SPARK-25091][SQL] reduce the storage memory in Executor...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22335
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
@mhamilton723 I thought about that option too. Loading general binary files 
is a useful feature but I don't feel it is necessary to pull it into the 
current scope. No matter whether the image data source has its own 
implementation or builds on top of the binary data source, I expect users to use

~~~scala
spark.read.format("image").load("...")
~~~

to read images instead of something like:

~~~scala
spark.read.format("binary").load("...").withColumn("image", 
decode($"binary"))
~~~

So we can definitely add binary file data source later and swap the 
implementation without changing the public interface. But we don't need to 
block this PR getting into 2.4, which will be cut soon.

Sounds good?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22335: [SPARK-25091][SQL] reduce the storage memory in Executor...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22335
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22335: [SPARK-25091][SQL] reduce the storage memory in Executor...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22335
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...

2018-09-04 Thread cfangplus
GitHub user cfangplus opened a pull request:

https://github.com/apache/spark/pull/22335

[SPARK-25091][SQL] reduce the storage memory in Executor Tab when …

…unpersist rdd
 @zsxwing 
@vanzin 
@attilapiros 
## What changes were proposed in this pull request?
This issue is a UI issue. When we unpersist rdd or UNCACHE TABLE,the 
storage memory size in Executor tab could not be reduced. So I modify code as 
follow:
1,BlockManager send block status to BlockManagerMaster after it removed 
the block
2,AppStatusListener receive block remove status and modify LiveRDD and 
LiveExecutor info, so the storage memory size and rdd blocks in Executor Tab 
could change when we unpersist rdd. 

## How was this patch tested?
cache table and then unpersist table. Watch the WebUI for both storage and 
executor tab.

![storage-tab-before](https://user-images.githubusercontent.com/42287875/45072023-b721c380-b10b-11e8-9e03-1b87e899ec6f.png)

![executor-tab-before](https://user-images.githubusercontent.com/42287875/45072024-b7ba5a00-b10b-11e8-83da-b959f498ea10.png)

![storage-tab-after](https://user-images.githubusercontent.com/42287875/45072025-b7ba5a00-b10b-11e8-80ee-0b62969c0d17.png)

![executor-tab-after](https://user-images.githubusercontent.com/42287875/45072027-b852f080-b10b-11e8-9b0f-3a6ad551492b.png)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cfangplus/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22335.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22335


commit 5955fa6c6daa3d345046fc1e1fdc6e7eb73cd5e2
Author: cfangplus 
Date:   2018-09-04T13:14:12Z

[SPARK-25091][WEB UI] reduce the storage memory in Executor Tab when 
unpersist rdd




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

2018-09-04 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22234
  
Have we documented the behavior changes in the migration guide? If not, can 
we do it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215135665
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
--- End diff --

The latter "options" is "datasource options", it is the widely used term.
So I prefer to change to "optionally specify the datasource options"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22138
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22138
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95688/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22138
  
**[Test build #95688 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95688/testReport)**
 for PR 22138 at commit 
[`9685cc5`](https://github.com/apache/spark/commit/9685cc59255458c41d293da4117954cc653fab28).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22313
  
Also, thank you for review, @xuanyuanking, @kiszk , @viirya , @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22320
  
**[Test build #95692 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95692/testReport)**
 for PR 22320 at commit 
[`4590c98`](https://github.com/apache/spark/commit/4590c9837026e820d7d91300a7ab3f87a668755c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22320
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2853/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22320
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

2018-09-04 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22320#discussion_r215128076
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -56,7 +56,7 @@ case class InsertIntoHadoopFsRelationCommand(
 mode: SaveMode,
 catalogTable: Option[CatalogTable],
 fileIndex: Option[FileIndex],
-outputColumns: Seq[Attribute])
+outputColumnNames: Seq[String])
   extends DataWritingCommand {
   import 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName
 
--- End diff --

Oh, then we can use this method instead.
```
def checkColumnNameDuplication(
  columnNames: Seq[String], colType: String, caseSensitiveAnalysis: 
Boolean): Unit
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22298
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95686/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22298
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22298
  
**[Test build #95686 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95686/testReport)**
 for PR 22298 at commit 
[`7dc26ce`](https://github.com/apache/spark/commit/7dc26ce7a06715a90950e57ee5519979fef4fc27).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22299: [SPARK-24748][SS][FOLLOWUP] Switch custom metrics to Uns...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22299
  
Let's close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22296: [SPARK-24748][SS][FOLLOWUP] Switch custom metrics...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/22296


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22332: [SPARK-25333][SQL] Ability add new columns in Dataset in...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22332
  
Can't we simply `select` after the the column is added? I wouldn't add this 
as well - it can look confusing to be honest IMO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22329
  
cc @gatorsmile and @BryanCutler 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it'...

2018-09-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22333#discussion_r215124159
  
--- Diff: build/mvn ---
@@ -91,15 +92,23 @@ install_mvn() {
 
 # Install zinc under the build/ folder
 install_zinc() {
-  local zinc_path="zinc-0.3.15/bin/zinc"
-  [ ! -f "${_DIR}/${zinc_path}" ] && ZINC_INSTALL_FLAG=1
-  local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com}
+  ZINC_VERSION=0.3.15
+  ZINC_BIN="$(command -v zinc)"
+  if [ "$ZINC_BIN" ]; then
+local ZINC_DETECTED_VERSION="$(zinc -version | head -n1 | awk '{print 
$5}')"
+  fi
 
-  install_app \
-"${TYPESAFE_MIRROR}/zinc/0.3.15" \
-"zinc-0.3.15.tgz" \
-"${zinc_path}"
-  ZINC_BIN="${_DIR}/${zinc_path}"
+  if [ $(version $ZINC_DETECTED_VERSION) -lt $(version $ZINC_VERSION) ]; 
then
--- End diff --

Thank you for review, @srowen . Yes. We use the same logic for Maven.
If no zinc was detected, it is evaluated as `0 -lt 03015`, and 
eventually `false`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/7#discussion_r215124064
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2546,15 +2546,39 @@ object functions {
   def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
 
   /**
-   * Splits str around pattern (pattern is a regular expression).
+   * Splits str around matches of the given regex.
*
-   * @note Pattern is a string representation of the regular expression.
+   * @param str a string expression to split
+   * @param regex a string representing a regular expression. The regex 
string should be
+   *  a Java regular expression.
*
* @group string_funcs
* @since 1.5.0
*/
-  def split(str: Column, pattern: String): Column = withExpr {
-StringSplit(str.expr, lit(pattern).expr)
+  def split(str: Column, regex: String): Column = withExpr {
--- End diff --

Yea, I don't think we should change the name in case either makes sense in 
a way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/7#discussion_r215123978
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java 
---
@@ -394,12 +394,14 @@ public void substringSQL() {
 
   @Test
   public void split() {
-
assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), -1),
-  new UTF8String[]{fromString("ab"), fromString("def"), 
fromString("ghi")}));
-
assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
-  new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
-
assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
-  new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
+UTF8String[] negativeAndZeroLimitCase =
+new UTF8String[]{fromString("ab"), fromString("def"), 
fromString("ghi"), fromString("")};
+
assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), 0),
+negativeAndZeroLimitCase));
+
assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), -1),
--- End diff --

Let's fix the indentation to show less diff.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22234
  
From my understanding, yea. The problem here is sounds like ambiguity in 
empty strings since they can be interpreted as empty strings and also `null`. 
To me, this is actually rather a bug since we can't distinguish, and don't 
respect the empty value. If empty strings are written, they should be read as 
empty strings.

This PR proposes an ability explicitly set the empty value to work around 
the behaviour change.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22219: [SPARK-25224][SQL] Improvement of Spark SQL Thrif...

2018-09-04 Thread Dooyoung-Hwang
Github user Dooyoung-Hwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22219#discussion_r215122865
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3237,6 +3238,28 @@ class Dataset[T] private[sql](
 files.toSet.toArray
   }
 
+  /**
+   * Returns the tuple of the row count and an SeqView that contains all 
rows in this Dataset.
+   *
+   * The SeqView will consume as much memory as the total size of 
serialized results which can be
+   * limited with the config 'spark.driver.maxResultSize'. Rows are 
deserialized when iterating rows
+   * with iterator of returned SeqView. Whether to collect all 
deserialized rows or to iterate them
+   * incrementally can be decided with considering total rows count and 
driver memory.
+   */
+  private[sql] def collectCountAndSeqView(): (Long, SeqView[T, Array[T]]) =
+withAction("collectCountAndSeqView", queryExecution) { plan =>
+  // This projection writes output to a `InternalRow`, which means 
applying this projection is
+  // not thread-safe. Here we create the projection inside this method 
to make `Dataset`
+  // thread-safe.
+  val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
+  val (totalRowCount, internalRowsView) = plan.executeCollectSeqView()
+  (totalRowCount, internalRowsView.map { row =>
+// The row returned by SafeProjection is `SpecificInternalRow`, 
which ignore the data type
+// parameter of its `get` method, so it's safe to use null here.
+objProj(row).get(0, null).asInstanceOf[T]
+  }.asInstanceOf[SeqView[T, Array[T]]])
+}
--- End diff --

Ok, you don't prefer adding function to Dataset. If withAction & 
deserializer are changed to private[sql], this implementation can be moved out. 
Is this function useful for other SQL server to reduce memory usage of query 
execution? I don't think it looks good because Projection is created in the 
outside of Dataset.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22306: [SPARK-25300][CORE]Unified the configuration para...

2018-09-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22306


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22306
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22183: [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive...

2018-09-04 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22183


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22313
  
Thank you, @cloud-fan . Sure. I'll update them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22329
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22329
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95690/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22329
  
**[Test build #95690 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95690/testReport)**
 for PR 22329 at commit 
[`2ad350c`](https://github.com/apache/spark/commit/2ad350c79bd2004282a43d6d189a828cad54cc60).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95681/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22313


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22192
  
**[Test build #95681 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95681/testReport)**
 for PR 22192 at commit 
[`5a2852f`](https://github.com/apache/spark/commit/5a2852fb587278ab6a81c94de4226306a2e0da70).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22313
  
@dongjoon-hyun please also update the title of the JIRA ticket, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22313
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22319
  
**[Test build #95691 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95691/testReport)**
 for PR 22319 at commit 
[`9e060a4`](https://github.com/apache/spark/commit/9e060a4cc9360a0ebf59db05c0e1466b8e66b157).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2852/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95685/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22313
  
**[Test build #95685 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95685/testReport)**
 for PR 22313 at commit 
[`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21721: [SPARK-24748][SS] Support for reporting custom metrics v...

2018-09-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21721
  
BTW I think this is probably SPIP-worthy. At the very least we should write 
a design doc on this, similar to the other docs for dsv2 sub-components. We 
should really think about whether it'd be possible to unify the three modes 
(batch, microbatch streaming, CP).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mhamilton723
Github user mhamilton723 commented on the issue:

https://github.com/apache/spark/pull/22328
  
@WeichenXu123. Awesome work! I have not had a chance to go through this in 
depth but I did this in the originating project, [MMLSpark](www.aka.ms/spark), 
a while back and have been meaning to ship it up to Spark Core:

https://github.com/Azure/mmlspark/tree/master/src/io/image

We built ours on the same functionality for Binary Files and it might be 
wise to do the same here if you have not already. The Binary File API is a bit 
more general as it supports reading many different kinds of files. 

https://github.com/Azure/mmlspark/tree/master/src/io/binary


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22329
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2851/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22329
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22329
  
**[Test build #95690 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95690/testReport)**
 for PR 22329 at commit 
[`2ad350c`](https://github.com/apache/spark/commit/2ad350c79bd2004282a43d6d189a828cad54cc60).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...

2018-09-04 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22306
  
cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22219: [SPARK-25224][SQL] Improvement of Spark SQL Thrif...

2018-09-04 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22219#discussion_r215115685
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3237,6 +3238,28 @@ class Dataset[T] private[sql](
 files.toSet.toArray
   }
 
+  /**
+   * Returns the tuple of the row count and an SeqView that contains all 
rows in this Dataset.
+   *
+   * The SeqView will consume as much memory as the total size of 
serialized results which can be
+   * limited with the config 'spark.driver.maxResultSize'. Rows are 
deserialized when iterating rows
+   * with iterator of returned SeqView. Whether to collect all 
deserialized rows or to iterate them
+   * incrementally can be decided with considering total rows count and 
driver memory.
+   */
+  private[sql] def collectCountAndSeqView(): (Long, SeqView[T, Array[T]]) =
+withAction("collectCountAndSeqView", queryExecution) { plan =>
+  // This projection writes output to a `InternalRow`, which means 
applying this projection is
+  // not thread-safe. Here we create the projection inside this method 
to make `Dataset`
+  // thread-safe.
+  val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
+  val (totalRowCount, internalRowsView) = plan.executeCollectSeqView()
+  (totalRowCount, internalRowsView.map { row =>
+// The row returned by SafeProjection is `SpecificInternalRow`, 
which ignore the data type
+// parameter of its `get` method, so it's safe to use null here.
+objProj(row).get(0, null).asInstanceOf[T]
+  }.asInstanceOf[SeqView[T, Array[T]]])
+}
--- End diff --

how about changing private to private[sql], then implementing this based on 
the deserializer?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it'...

2018-09-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22333#discussion_r215115678
  
--- Diff: build/mvn ---
@@ -91,15 +92,23 @@ install_mvn() {
 
 # Install zinc under the build/ folder
 install_zinc() {
-  local zinc_path="zinc-0.3.15/bin/zinc"
-  [ ! -f "${_DIR}/${zinc_path}" ] && ZINC_INSTALL_FLAG=1
-  local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com}
+  ZINC_VERSION=0.3.15
+  ZINC_BIN="$(command -v zinc)"
+  if [ "$ZINC_BIN" ]; then
+local ZINC_DETECTED_VERSION="$(zinc -version | head -n1 | awk '{print 
$5}')"
+  fi
 
-  install_app \
-"${TYPESAFE_MIRROR}/zinc/0.3.15" \
-"zinc-0.3.15.tgz" \
-"${zinc_path}"
-  ZINC_BIN="${_DIR}/${zinc_path}"
+  if [ $(version $ZINC_DETECTED_VERSION) -lt $(version $ZINC_VERSION) ]; 
then
--- End diff --

I assume this evaluates to false if no zinc was detected; that seems to be 
how the maven logic above works. OK. I also would like to use my local zinc if 
available.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22334: [SPARK-25336][SS]Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22334
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >