[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22328


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215322021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represents the origin of the image.
+ *If loaded from files, then it is the file path)
+ *  - height: Int (height of the image)
+ *  - width: Int (width of the image)
+ *  - nChannels: Int (number of the image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the data source options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", true)
+ * .load("data/mllib/images/partitioned")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", true)
+ * .load("data/mllib/images/partitioned");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
--- End diff --

How about changing `dropImageFailures` to `dropInvalid`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215322673
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+private[image] class ImageOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
--- End diff --

Should add ScalaDoc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215320762
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
--- End diff --

"IMAGE" doesn't need to be all uppercase. Just say "loading images".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215321353
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represents the origin of the image.
+ *If loaded from files, then it is the file path)
+ *  - height: Int (height of the image)
+ *  - width: Int (width of the image)
+ *  - nChannels: Int (number of the image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
--- End diff --

ditto on "IMAGE"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215320923
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represents the origin of the image.
+ *If loaded from files, then it is the file path)
--- End diff --

does it always load from files?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215323149
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
+val images = spark.read.format("image").option("dropImageFailures", 
"true")
+  .load(imagePath + "/cls=multichannel/").collect()
+
+val firstBytes20Map = images.map { rrow =>
+  val row = rrow.getAs[Row]("image")
+  val filename = Paths.get(getOrigin(row)).getFileName().toString()
+  val mode = getMode(row)
+  val bytes20 = getData(row).slice(0, 20).toList
+  filename -> Tuple2(mode, bytes20)
--- End diff --

nit: It is useful to leave an inline comment here:)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215216011
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
--- End diff --

Really? What about `option(key: String, value: Boolean): DataFrameReader` 
then? There are more --> 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215200249
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+private[image] class ImageOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
--- End diff --

because `parameters` is `Map[String, String]` type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215179601
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

Re: @cloud-fan The Scala package doc doesn't work for Java, which requires 
a different format.

Re: @HyukjinKwon It would be nice to have some doc in the site, though I 
didn't find the list of built-in data sources in the doc site. I think it is 
okay to have docs in both locations for IDE users and for people search on the 
web.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215144730
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

Can we just simply remove this and make a followup for the doc in the site 
.. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215140040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

Is this a convention? AFAIK in the scala world we usually put document in 
package object.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215139263
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -567,6 +567,7 @@ object DataSource extends Logging {
 val parquet = classOf[ParquetFileFormat].getCanonicalName
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val image = "org.apache.spark.ml.source.image.ImageFileFormat"
--- End diff --

We did it for libsvm for historical reasons. Since image is a new source, I 
don't think we have compatibility issues.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215139063
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
+  }
+
+  override def shortName(): String = "image"
+
+  override protected def buildReader(
+  sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): (PartitionedFile) => 
Iterator[InternalRow] = {
--- End diff --

sample pushdown should be supported by data source v2 in the next release, 
then we can migrate the image source to data source v2 at that time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

for doc.
similar to `LibSVMDataSource`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138931
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

why do we need this class?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138889
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -567,6 +567,7 @@ object DataSource extends Logging {
 val parquet = classOf[ParquetFileFormat].getCanonicalName
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val image = "org.apache.spark.ml.source.image.ImageFileFormat"
--- End diff --

similar to `libsvm`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138862
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
+val images = spark.read.format("image").option("dropImageFailures", 
"true")
+  .load(imagePath + "/cls=multichannel/").collect()
+
+val firstBytes20Map = images.map { rrow =>
+  val row = rrow.getAs[Row]("image")
+  val filename = Paths.get(getOrigin(row)).getFileName().toString()
+  val mode = getMode(row)
+  val bytes20 = getData(row).slice(0, 20).toList
+  filename -> Tuple2(mode, bytes20)
--- End diff --

yea, `(mode, bytes20)` doesn't work, `filename -> (mode, bytes20)` will be 
compiled as `filename.->(mode, bytes20)` and `->` receive 2 arguments and 
compile error occurs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138728
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138711
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
--- End diff --

option API require (k: String, v:String) parameters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138305
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -29,7 +29,7 @@ package org.apache.spark.ml.source.image
  *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
  *
  * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
- * optionally specify options, for example:
+ * optionally specify the datasource options, for example:
--- End diff --

s/datasource/data source


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138635
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+private[image] class ImageOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
--- End diff --

Why `false` is a String not a boolean?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138476
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -45,6 +45,8 @@ package org.apache.spark.ml.source.image
  * IMAGE data source supports the following options:
  *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
  *
+ * @note This IMAGE data source does not support "write".
--- End diff --

s/"write"/saving images to a file(s)/ ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138174
  
--- Diff: data/mllib/images/images/license.txt ---
@@ -0,0 +1,13 @@
+The images in the folder "kittens" are under the creative commons CC0 
license, or no rights reserved:
--- End diff --

No worry. Only very few images. And keep old testcase not changed will help 
this PR get merged ASAP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215135665
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
--- End diff --

The latter "options" is "datasource options", it is the widely used term.
So I prefer to change to "optionally specify the datasource options"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215038606
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
+val images = spark.read.format("image").option("dropImageFailures", 
"true")
+  .load(imagePath + "/cls=multichannel/").collect()
+
+val firstBytes20Map = images.map { rrow =>
+  val row = rrow.getAs[Row]("image")
+  val filename = Paths.get(getOrigin(row)).getFileName().toString()
+  val mode = getMode(row)
+  val bytes20 = getData(row).slice(0, 20).toList
+  filename -> Tuple2(mode, bytes20)
--- End diff --

Why is `Tuple2` required here? Wouldn't `(mode, bytes20)` work here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215037240
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
--- End diff --

New line after `job: Job`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215039097
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -567,6 +567,7 @@ object DataSource extends Logging {
 val parquet = classOf[ParquetFileFormat].getCanonicalName
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val image = "org.apache.spark.ml.source.image.ImageFileFormat"
--- End diff --

Why is this needed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215036263
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
--- End diff --

"represents" + "the image". I can see many missing `a`s and `the`s in the 
description :(


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215037968
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
--- End diff --

`true` as a boolean value, please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215036643
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
--- End diff --

`true` as  a boolean value, please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215000161
  
--- Diff: data/mllib/images/images/license.txt ---
@@ -0,0 +1,13 @@
+The images in the folder "kittens" are under the creative commons CC0 
license, or no rights reserved:
--- End diff --

update to comment above:
I don't think you need duplicate images in the PR - in the old tests you 
can just specify the path up to the cls/date folder. There might be a few minor 
changes to the tests but I think that would be a better strategy than to have 
duplicate images in source.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214981991
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
+  }
+
+  override def shortName(): String = "image"
+
+  override protected def buildReader(
+  sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): (PartitionedFile) => 
Iterator[InternalRow] = {
--- End diff --

It won't be addressed in this PR. The best way to support it is to allow 
data source handle sampling operation. cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214975055
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
+  }
+
+  override def shortName(): String = "image"
+
+  override protected def buildReader(
+  sparkSession: SparkSession,
+  dataSchema: StructType,
+  partitionSchema: StructType,
+  requiredSchema: StructType,
+  filters: Seq[Filter],
+  options: Map[String, String],
+  hadoopConf: Configuration): (PartitionedFile) => 
Iterator[InternalRow] = {
--- End diff --

should the sampling option be ported as well?  It seemed like an important 
option in case users didn't want to load all images.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214973718
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
--- End diff --

tiny nit: remove newline (or make all tests consistent in terms of 
formatting)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214973111
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
+  }
+
+  override def shortName(): String = "image"
+
+  override protected def buildReader(
--- End diff --

hmm, is there any way we could combine the two apis?  I don't like having 
to support two different implementations.  Or, what is the issue that is 
blocking us from combining them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214970271
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
--- End diff --

tiny nit: although it makes sense, "optionally specify options" is a bit 
confusing, maybe "optionally specify arguments" or just "specify options"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214969396
  
--- Diff: data/mllib/images/images/license.txt ---
@@ -0,0 +1,13 @@
+The images in the folder "kittens" are under the creative commons CC0 
license, or no rights reserved:
--- End diff --

not sure why this was added to images/images folder, also strange that it 
wasn't moved


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214971542
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
--- End diff --

this is very cool!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214969542
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala ---
@@ -28,7 +28,7 @@ import org.apache.spark.sql.types._
 
 class ImageSchemaSuite extends SparkFunSuite with MLlibTestSparkContext {
   // Single column of images named "image"
-  private lazy val imagePath = "../data/mllib/images"
+  private lazy val imagePath = "../data/mllib/images/images"
--- End diff --

"images/images" is confusing. Call it `images/origin` and 
`images/partitioned`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214967994
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
--- End diff --

Should mention it doesn't support write.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214969782
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
+val images = spark.read.format("image").option("dropImageFailures", 
"true")
+  .load(imagePath + "/cls=multichannel/").collect()
+
+val firstBytes20Map = images.map { rrow =>
+  val row = rrow.getAs[Row]("image")
+  val filename = Paths.get(getOrigin(row)).getFileName().toString()
+  val mode = getMode(row)
+  val bytes20 = getData(row).slice(0, 20).toList
+  filename -> Tuple2(mode, bytes20)
+}.toMap
--- End diff --

use Set instead of Map


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214967452
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
--- End diff --

I didn't see a section in the doc that lists all built-in data sources. It 
would be nice if we create a section and link it to this API doc. I think we 
can do it with a follow-up PR. I want to see if we can get this PR merged 
before branch cut:)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214968664
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
--- End diff --

The error message is user-facing and users do not know `prepareWrite`. So 
just say "Write is not supported"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214861273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
--- End diff --

@WeichenXu123, don't we plan to make a documentation in the site?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214860618
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
+}
+
+private[image] class ImageFileFormat extends FileFormat with 
DataSourceRegister {
+
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = 
Some(ImageSchema.imageSchema)
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job, options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException(
+  s"prepareWrite is not supported for image data source")
--- End diff --

tiny nit: `s` can be removed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214859773
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -591,6 +592,8 @@ object DataSource extends Logging {
   "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc,
   "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
   "org.apache.spark.ml.source.libsvm" -> libsvm,
+  "org.apache.spark.ml.source.image.ImageFileFormat.DefaultSource" -> 
image,
--- End diff --

Since this is new datasource, I think we wouldn't probably need this in 
backword compatibility map.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214859244
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import com.google.common.io.{ByteStreams, Closeables}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+
+import org.apache.spark.ml.image.ImageSchema
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, 
OutputWriterFactory, PartitionedFile}
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+
+private[image] class ImageFileFormatOptions(
--- End diff --

Shall we make another file for this for consistency?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214859128
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

newline at the end - technically this should be caught by scalastyle.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r214858990
  
--- Diff: 
mllib/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 ---
@@ -1 +1,2 @@
 org.apache.spark.ml.source.libsvm.LibSVMFileFormat
+org.apache.spark.ml.source.image.ImageFileFormat
--- End diff --

tiny nit: newline at the end - this might be caught by maven's checkstyle 
(https://github.com/apache/spark/pull/21801)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22328

[SPARK-22666][ML][SQL] Spark datasource for image format

## What changes were proposed in this pull request?

Implement an image schema datasource.

This image datasource support:
  - partition discovery (loading partitioned images)
  - dropImageFailures (the same behavior with `ImageSchema.readImage`)
  - path wildcard matching (the same behavior with `ImageSchema.readImage`)
  - loading recursively from directory (different from 
`ImageSchema.readImage`, but use such path: `/path/to/dir/**`)

This datasource **NOT** support:
  - specify `numPartitions` (it will be determined by datasource 
automatically)
  - sampling (you can use `df.sample` later but the sampling operator won't 
be pushdown to datasource)

## How was this patch tested?
Unit tests.

## Benchmark
I benchmark and compare the cost time between old `ImageSchema.read` API 
and my image datasource.

**cluster**: 4 nodes, each with 64GB memory, 8 cores CPU
**test dataset**: Flickr8k_Dataset (about 8091 images)

**time cost**:
My image datasource time (automatically generate 258 partitions):  38.04s 
`ImageSchema.read` time (set 16 partitions): 68.4s
`ImageSchema.read` time (set 258 partitions):  90.6s

**time cost when increase image number by double (clone Flickr8k_Dataset 
and loads double number images):
My image datasource time (automatically generate 515 partitions):  95.4s 
`ImageSchema.read` (set 32 partitions): 109s
`ImageSchema.read` (set 515 partitions):  105s

So we can see that my image datasource implementation (this PR) bring some 
performance improvement compared against old`ImageSchema.read` API.











You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark image_datasource

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22328


commit 5b5aee66b2ea819341b624164298f0700ee07ddf
Author: WeichenXu 
Date:   2018-09-04T09:48:50Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org