date:20171102

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19649
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83375/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19649
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148712557
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
--- End diff --

Probably, we should avoid the term singleton. It might be technically 
inappropriate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19649
  
**[Test build #83375 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83375/testReport)**
 for PR 19649 at commit 
[`a3867b7`](https://github.com/apache/spark/commit/a3867b78c1a64f7d3196aaef6ab63db740dcc758).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19650: [SPARK-22254][core] Fix the arrayMax in BufferHolder

2017-11-02 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19650
  
@srowen @gatorsmile could you please review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148712017
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
--- End diff --

I would like to keep it simple if the current way does not cause a harm to 
the functionalities, for now, if it sounds okay to you and other reviewers too. 
Please let me know if I missed anything here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148711768
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
--- End diff --

> Is this a standard way to define singletons in Python? 

Up to my knowledge, there are many workarounds to resemble the singleton 
but I am pretty sure there is no standard way AFAIK.

We have similar examples for this pattern within Spark:


https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/accumulators.py#L104


https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/storagelevel.py#L52-L58


There is an example for another approach within Spark actually:


https://github.com/apache/spark/blob/17af727e38c3faaeab5b91a8cdab5f2181cf3fc4/python/pyspark/sql/types.py#L96-L106

I am okay to use this approach as well. However, basically they should the 
similar things in our case. This way also requires to create at least single 
instance. The point of this approach I believe is to return the same instance 
for `__init__` but I think our case should disallow `__init__` itself.

> What happens when this package or module gets reloaded? 

I think reloading won't affect the functionalities here. It creates 
`_ImageSchema` class again, creates an instance of `_ImageSchema` and 
monkey-paches `__init__` to disallow creating instances.

```python
>>> from pyspark.ml import image
>>> print image.ImageSchema._imageSchema
None
>>> print image.ImageSchema.imageSchema

StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true)))
>>> print image.ImageSchema._imageSchema

StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true)))
>>> reload(image)

>>> print image.ImageSchema._imageSchema
None
>>> print image.ImageSchema.imageSchema

StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true)))
>>> print image.ImageSchema._imageSchema

StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true)))
```

I think we are fine if we happen to have multiple instances for 
`ImageSchema` actually because it only has cached attributes in each instance.

>  Numpy uses a somewhat different approach: 
https://github.com/numpy/numpy/blob/d75b86c0c49f7eb3ec60564c2e23b3ff237082a2/numpy/_globals.py

I think Numpy case tries to keep the same object comparison clean, `a is 
b`. In our case, I guess it does not require actually. (and partly this is a 
reason why I used the term "singleton-like").



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19532
  
**[Test build #83377 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83377/testReport)**
 for PR 19532 at commit 
[`9670d6f`](https://github.com/apache/spark/commit/9670d6f8b3fed58f556b9050814f7c61bc5d65be).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19650: [SPARK-22254][core] Fix the arrayMax in BufferHolder

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19650
  
**[Test build #83376 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83376/testReport)**
 for PR 19650 at commit 
[`0a536fb`](https://github.com/apache/spark/commit/0a536fb0dbaa9e1e929610ecc15ac416253fd823).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19650: [SPARK-22254][core] Fix the arrayMax in BufferHol...

2017-11-02 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/19650

[SPARK-22254][core] Fix the arrayMax in BufferHolder

## What changes were proposed in this pull request?

This PR replaces the old the maximum array size (`Int.MaxValue`) with the 
new one (`ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH`).
This PR also refactor the code to calculate the new array size to easily 
understand why we have to use `newSize - 2` for allocating a new array.

## How was this patch tested?

Used the existing test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-22254

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19650


commit 0a536fb0dbaa9e1e929610ecc15ac416253fd823
Author: Kazuaki Ishizaki 
Date:   2017-11-03T03:14:00Z

initial commit




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #83382 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83382/testReport)**
 for PR 19651 at commit 
[`fdde274`](https://github.com/apache/spark/commit/fdde27416fe54036afbd9b809a363e7871df67cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...

2017-11-02 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19643#discussion_r148710840
  
--- Diff: R/pkg/R/context.R ---
@@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) {
   invisible(callJMethod(sc, "addFile", 
suppressWarnings(normalizePath(path)), recursive))
 }
 
+#' Adds a JAR dependency for Spark tasks to be executed in the future.
+#'
+#' The \code{path} passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
--- End diff --

is `local:/path` referring to windows drive/path, or the actual text 
`local:/` should be there?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...

2017-11-02 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19643#discussion_r148710799
  
--- Diff: R/pkg/R/context.R ---
@@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) {
   invisible(callJMethod(sc, "addFile", 
suppressWarnings(normalizePath(path)), recursive))
 }
 
+#' Adds a JAR dependency for Spark tasks to be executed in the future.
+#'
+#' The \code{path} passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
+#' If \code{addToCurrentClassLoader} is true, add the jar to the current 
driver.
--- End diff --

hmm, is this right `add the jar to the current driver.`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...

2017-11-02 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19643#discussion_r148710943
  
--- Diff: R/pkg/R/context.R ---
@@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) {
   invisible(callJMethod(sc, "addFile", 
suppressWarnings(normalizePath(path)), recursive))
 }
 
+#' Adds a JAR dependency for Spark tasks to be executed in the future.
+#'
+#' The \code{path} passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
+#' If \code{addToCurrentClassLoader} is true, add the jar to the current 
driver.
+#'
+#' @rdname spark.addJar
+#' @param path The path of the jar to be added
+#' @param addToCurrentClassLoader Whether to add the jar to the current 
driver class loader.
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.addJar("/path/to/something.jar", TRUE)
+#'}
+#' @note spark.addJar since 2.3.0
+spark.addJar <- function(path, addToCurrentClassLoader = FALSE) {
+  normalizedPath <- suppressWarnings(normalizePath(path))
--- End diff --

yea, normalizePath wouldn't handle url...
https://stat.ethz.ch/R-manual/R-devel/library/base/html/normalizePath.html

I think we should require absolute paths in their canonical form here and 
just pass through..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-02 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pus...

2017-11-02 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19647#discussion_r148710991
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -332,32 +332,18 @@ object LimitPushDown extends Rule[LogicalPlan] {
 // pushdown Limit.
 case LocalLimit(exp, Union(children)) =>
   LocalLimit(exp, Union(children.map(maybePushLocalLimit(exp, _
-// Add extra limits below OUTER JOIN. For LEFT OUTER and FULL OUTER 
JOIN we push limits to the
-// left and right sides, respectively. For FULL OUTER JOIN, we can 
only push limits to one side
-// because we need to ensure that rows from the limited side still 
have an opportunity to match
-// against all candidates from the non-limited side. We also need to 
ensure that this limit
-// pushdown rule will not eventually introduce limits on both sides if 
it is applied multiple
-// times. Therefore:
+// Add extra limits below OUTER JOIN. For LEFT OUTER and RIGHT OUTER 
JOIN we push limits to
+// the left and right sides, respectively. It's not safe to push 
limits below FULL OUTER
+// JOIN in the general case without a more invasive rewrite.
+// We also need to ensure that this limit pushdown rule will not 
eventually introduce limits
+// on both sides if it is applied multiple times. Therefore:
 //   - If one side is already limited, stack another limit on top if 
the new limit is smaller.
 // The redundant limit will be collapsed by the CombineLimits rule.
 //   - If neither side is limited, limit the side that is estimated to 
be bigger.
 case LocalLimit(exp, join @ Join(left, right, joinType, _)) =>
   val newJoin = joinType match {
 case RightOuter => join.copy(right = maybePushLocalLimit(exp, 
right))
 case LeftOuter => join.copy(left = maybePushLocalLimit(exp, left))
-case FullOuter =>
--- End diff --

Thanks for working on it! We should still keep it. 

Let me fix it based on my original PR: 
https://github.com/apache/spark/pull/10454


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83381/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-02 Thread dongjoon-hyun

GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19651

[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 
1.4.1

## What changes were proposed in this pull request?

Since [SPARK-2883](https://issues.apache.org/jira/browse/SPARK-2883), 
Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. 
This PR aims to add a new ORC data source inside `sql/core` and to replace the 
old ORC data source eventually. This PR resolves the following three issues.

- SPARK-20682: Add new ORCFileFormat based on Apache ORC 1.4.1
- SPARK-15474: ORC data source fails to write and read back empty dataframe
- SPARK-21791: ORC should support column names with dot

## How was this patch tested?

Pass the Jenkins with the existing all tests and new tests for SPARK-15474 
and SPARK-21791.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-20682

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19651


commit fdde27416fe54036afbd9b809a363e7871df67cf
Author: Dongjoon Hyun 
Date:   2017-05-15T02:33:15Z

[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on 
Apache ORC 1.4.1




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19624: [SPARKR][SPARK-22315] Warn if SparkR package version doe...

2017-11-02 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19624
  
Maybe but it will say

`Version mismatch between Spark JVM and SparkR package. JVM version was 
2.3.0-SNAPSHOT, while R package version was 2.1.2`

I think it will be clear the numbers are different.
If they are both 2.3.0 it won't show the warning.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148707442
  
--- Diff: python/pyspark/sql/session.py ---
@@ -512,9 +512,39 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
+import numpy as np
+
+# Convert pandas.DataFrame to list of numpy records
+np_records = data.to_records(index=False)
+
+# Check if any columns need to be fixed for Spark to infer 
properly
+record_type_list = None
+if schema is None and len(np_records) > 0:
+cur_dtypes = np_records[0].dtype
+col_names = cur_dtypes.names
+record_type_list = []
+has_rec_fix = False
+for i in xrange(len(cur_dtypes)):
--- End diff --

Ooops, I forgot about that. thx!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148707441
  
--- Diff: python/pyspark/sql/session.py ---
@@ -512,9 +512,39 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
+import numpy as np
+
+# Convert pandas.DataFrame to list of numpy records
+np_records = data.to_records(index=False)
+
+# Check if any columns need to be fixed for Spark to infer 
properly
+record_type_list = None
+if schema is None and len(np_records) > 0:
+cur_dtypes = np_records[0].dtype
+col_names = cur_dtypes.names
+record_type_list = []
+has_rec_fix = False
+for i in xrange(len(cur_dtypes)):
+curr_type = cur_dtypes[i]
+# If type is a datetime64 timestamp, convert to 
microseconds
+# NOTE: if dtype is M8[ns] then np.record.tolist() 
will output values as longs,
+# this conversion will lead to an output of py 
datetime objects, see SPARK-22417
+if curr_type == np.dtype('M8[ns]'):
+curr_type = 'M8[us]'
+has_rec_fix = True
+record_type_list.append((str(col_names[i]), curr_type))
+if not has_rec_fix:
+record_type_list = None
--- End diff --

Yeah, probably a good idea.  I'll see if I can clean it up some. Thanks 
@viirya !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148709143
  
--- Diff: python/pyspark/sql/session.py ---
@@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _getNumpyRecordDtypes(self, rec):
+"""
+Used when converting a pandas.DataFrame to Spark using 
to_records(), this will correct
+the dtypes of records so they can be properly loaded into Spark.
+:param rec: a numpy record to check dtypes
+:return corrected dtypes for a numpy.record or None if no 
correction needed
+"""
+import numpy as np
+cur_dtypes = rec.dtype
+col_names = cur_dtypes.names
+record_type_list = []
+has_rec_fix = False
+for i in xrange(len(cur_dtypes)):
+curr_type = cur_dtypes[i]
+# If type is a datetime64 timestamp, convert to microseconds
+# NOTE: if dtype is M8[ns] then np.record.tolist() will output 
values as longs,
+# this conversion will lead to an output of py datetime 
objects, see SPARK-22417
+if curr_type == np.dtype('M8[ns]'):
--- End diff --

There shouldn't be any difference for the most part.  I only used `M8` here 
because when debugging these types, that is what was being output for the 
record types by `numpy.record.dtype`.  Would you prefer `datetime64` if that 
works?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics

2017-11-02 Thread tengpeng

Github user tengpeng commented on the issue:

https://github.com/apache/spark/pull/19638
  
I have used @sethah 's approach to address the issues we have. Since we are 
not adding a new method to the public trait, there is no more binary 
compatibility issue.





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19646
  
**[Test build #83380 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83380/testReport)**
 for PR 19646 at commit 
[`3944b5c`](https://github.com/apache/spark/commit/3944b5ca3f2ac588ee4a997ca55c10ed1456c7cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19646
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19646
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83380/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19646
  
**[Test build #83380 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83380/testReport)**
 for PR 19646 at commit 
[`3944b5c`](https://github.com/apache/spark/commit/3944b5ca3f2ac588ee4a997ca55c10ed1456c7cf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19646
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19643
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19643
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83365/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19646
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19532
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19532
  
**[Test build #83377 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83377/testReport)**
 for PR 19532 at commit 
[`9670d6f`](https://github.com/apache/spark/commit/9670d6f8b3fed58f556b9050814f7c61bc5d65be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19532
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83377/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148708752
  
--- Diff: python/pyspark/sql/session.py ---
@@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _getNumpyRecordDtypes(self, rec):
+"""
+Used when converting a pandas.DataFrame to Spark using 
to_records(), this will correct
+the dtypes of records so they can be properly loaded into Spark.
+:param rec: a numpy record to check dtypes
+:return corrected dtypes for a numpy.record or None if no 
correction needed
+"""
+import numpy as np
+cur_dtypes = rec.dtype
+col_names = cur_dtypes.names
+record_type_list = []
+has_rec_fix = False
+for i in xrange(len(cur_dtypes)):
+curr_type = cur_dtypes[i]
+# If type is a datetime64 timestamp, convert to microseconds
+# NOTE: if dtype is M8[ns] then np.record.tolist() will output 
values as longs,
+# this conversion will lead to an output of py datetime 
objects, see SPARK-22417
+if curr_type == np.dtype('M8[ns]'):
--- End diff --

Isn't this `datetime64[ns]`? What's the defference between `M8[ns]` and 
`datetime64[ns]`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19646
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83379/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19643
  
**[Test build #83365 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83365/testReport)**
 for PR 19643 at commit 
[`b928ab8`](https://github.com/apache/spark/commit/b928ab82c54a72f91c01ba9d3418c80c58ab9da1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19646
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...

2017-11-02 Thread guoxiaolongzte

Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/19532
  
Thank you for your review comments, I have to restore the code, not running 
in the code calculation. Now only keep the document changes. Please review 
again.
@srowen @jiangxb1987 @cloud-fan @ajbozarth 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19646
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83378/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19646
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148709362
  
--- Diff: python/pyspark/sql/session.py ---
@@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _getNumpyRecordDtypes(self, rec):
+"""
+Used when converting a pandas.DataFrame to Spark using 
to_records(), this will correct
+the dtypes of records so they can be properly loaded into Spark.
+:param rec: a numpy record to check dtypes
+:return corrected dtypes for a numpy.record or None if no 
correction needed
+"""
+import numpy as np
+cur_dtypes = rec.dtype
+col_names = cur_dtypes.names
+record_type_list = []
+has_rec_fix = False
+for i in xrange(len(cur_dtypes)):
+curr_type = cur_dtypes[i]
+# If type is a datetime64 timestamp, convert to microseconds
+# NOTE: if dtype is M8[ns] then np.record.tolist() will output 
values as longs,
+# this conversion will lead to an output of py datetime 
objects, see SPARK-22417
+if curr_type == np.dtype('M8[ns]'):
--- End diff --

Yes, I'd prefer it if that works, otherwise I'd like you to add some 
comments saying we can use `M8[ns]` instead of `datetime64[ns]`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148709757
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2592,6 +2592,16 @@ def test_create_dataframe_from_array_of_long(self):
 df = self.spark.createDataFrame(data)
 self.assertEqual(df.first(), Row(longarray=[-9223372036854775808, 
0, 9223372036854775807]))
 
+@unittest.skipIf(not _have_pandas, "Pandas not installed")
+def test_create_dataframe_from_pandas_with_timestamp(self):
+import pandas as pd
+from datetime import datetime
+pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)],
+"d": [pd.Timestamp.now().date()]})
+df = self.spark.createDataFrame(pdf)
--- End diff --

What if we specify the schema? For example:

```
df = self.spark.createDataFrame(pdf, "ts timestamp, d date")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-02 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r148709163
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala ---
@@ -202,6 +202,15 @@ class LinearSVCSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   dataset.as[LabeledPoint], estimator, modelEquals, 42L)
   }
 
+  test("prediction on single instance") {
+val trainer = new LinearSVC()
+val model = trainer.fit(smallBinaryDataset)
+model.transform(smallBinaryDataset).select("features", 
"prediction").collect().foreach {
+  case Row(features: Vector, prediction: Double) =>
+assert(prediction ~== model.predict(features) relTol 1E-5)
--- End diff --

Could you please check for exact equality here (& in other tests)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-02 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r148708939
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
@@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Vector, NaiveBayesModel](model, testDataset)
   }
 
+  test("prediction on single instance") {
+val nPoints = 1000
+val piArray = Array(0.5, 0.1, 0.4).map(math.log)
+val thetaArray = Array(
+  Array(0.70, 0.10, 0.10, 0.10), // label 0
+  Array(0.10, 0.70, 0.10, 0.10), // label 1
+  Array(0.10, 0.10, 0.70, 0.10)  // label 2
+).map(_.map(math.log))
+val pi = Vectors.dense(piArray)
+val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
+
+val testDataset =
+  generateNaiveBayesInput(piArray, thetaArray, nPoints, seed, 
"multinomial").toDF()
+val nb = new NaiveBayes().setSmoothing(1.0).setModelType("multinomial")
+val model = nb.fit(testDataset)
+
+validateModelFit(pi, theta, model)
--- End diff --

Do we need lines 184-186? They seem unrelated to what we want to test (that 
`predict` produces the same result as `transform` on a single instance). 
Similarly, I don't think we need to create `piArray`, `thetaArray`, `pi`, 
`theta`, etc; this test should just fit a model on a dataset and compare the 
fitted model's `predict` and `transform` outputs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-02 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r148709125
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
@@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Vector, NaiveBayesModel](model, testDataset)
   }
 
+  test("prediction on single instance") {
+val nPoints = 1000
+val piArray = Array(0.5, 0.1, 0.4).map(math.log)
+val thetaArray = Array(
+  Array(0.70, 0.10, 0.10, 0.10), // label 0
+  Array(0.10, 0.70, 0.10, 0.10), // label 1
+  Array(0.10, 0.10, 0.70, 0.10)  // label 2
+).map(_.map(math.log))
+val pi = Vectors.dense(piArray)
+val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
+
+val testDataset =
--- End diff --

Suggestion: rename to `trainDataset`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148706148
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+

[GitHub] spark issue #19625: [SPARK-22407][WEB-UI] Add rdd id column on storage page ...

2017-11-02 Thread guoxiaolongzte

Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/19625
  
Please upload the screenshot in PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread pralabhkumar

Github user pralabhkumar commented on the issue:

https://github.com/apache/spark/pull/18118
  
@sethah Build is passed :) ,have done the changes as suggested (setting 
maxIter and maxDepth).

ping @MLnick or @jkbradley so we can move ahead with it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18118
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18118
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83372/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18118
  
**[Test build #83372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83372/testReport)**
 for PR 18118 at commit 
[`0e9507e`](https://github.com/apache/spark/commit/0e9507e13a352b59798993228709bf7654747d0c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19642
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83366/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19642
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19642
  
**[Test build #83366 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83366/testReport)**
 for PR 19642 at commit 
[`2711eaf`](https://github.com/apache/spark/commit/2711eaf25f55c954842ee9498f9706df33d69e1a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-11-02 Thread zhengruifeng

Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r148702816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,456 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+
+// Prepares output columns with proper attributes by examining input 
columns.
+val inputFields = $(inputCols).map(schema(_))
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19649
  
**[Test build #83375 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83375/testReport)**
 for PR 19649 at commit 
[`a3867b7`](https://github.com/apache/spark/commit/a3867b78c1a64f7d3196aaef6ab63db740dcc758).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148702060
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148701873
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -236,12 +252,17 @@ object CrossValidator extends 
MLReadable[CrossValidator] {
 class CrossValidatorModel private[ml] (
 @Since("1.4.0") override val uid: String,
 @Since("1.2.0") val bestModel: Model[_],
-@Since("1.5.0") val avgMetrics: Array[Double])
+@Since("1.5.0") val avgMetrics: Array[Double],
+@Since("2.3.0") val subModels: Option[Array[Array[Model[_)
--- End diff --

So also need a `setSubModels` method for setting it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19648
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19648
  
**[Test build #83371 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83371/testReport)**
 for PR 19648 at commit 
[`5eee170`](https://github.com/apache/spark/commit/5eee170bfa67b7dd75b4a93c30559c354b99b541).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19648
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83371/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r148701451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
 instr.logParams(numFolds, seed, parallelism)
 logTuningParams(instr)
 
+val collectSubModelsParam = $(collectSubModels)
+
+var subModels: Option[Array[Array[Model[_ = if 
(collectSubModelsParam) {
--- End diff --

@holdenk @jkbradley I already thought about this issue. The reason I use 
this way is:
1) When `$(collectSubModels) == false`, the `modelFutures` and 
`foldMetricFutures` will be executed in pipelined way, this will make sure that 
the `model` generated in `modelFutures` will be released in time, so that the 
maximum memory cost will be `numParallelism * sizeof(model)`.  If we use the 
way of "collecting modelFutures", it will increase the memory cost to be 
`$(estimatorParamMaps).length * sizeof(model)` . This is a serious issue which 
is discussed before.
2) IMO the mutation on L145 won't influence performance. and it do not need 
something like lock, there is no race condition.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19648
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19648
  
**[Test build #83368 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83368/testReport)**
 for PR 19648 at commit 
[`0fd9a5c`](https://github.com/apache/spark/commit/0fd9a5c617053ced4210432f261a0053a04442dd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19648
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83368/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19641
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83374/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19641
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148700390
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+

[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...

2017-11-02 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19641
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148700189
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19649
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83373/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19649
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148699685
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
--- End diff --

Is this a standard way to define singletons in Python?  I've seen several 
methods searching online.  Numpy uses a somewhat different approach: 
https://github.com/numpy/numpy/blob/d75b86c0c49f7eb3ec60564c2e23b3ff237082a2/numpy/_globals.py

What happens when this package or module gets reloaded?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent

2017-11-02 Thread jerryshao

GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/19649

[SPARK-22405][SQL] Add more ExternalCatalogEvent

## What changes were proposed in this pull request?

We're building a data lineage tool in which we need to monitor the metadata 
changes in ExternalCatalog, current ExternalCatalog already provides several 
useful events like "CreateDatabaseEvent" for custom SparkListener to use. But 
still there's some event missing, like alter database event and alter table 
event. So here propose to and new ExternalCatalogEvent.

## How was this patch tested?

Enrich the current UT and tested on local cluster.

CC @hvanhovell please let me know your comments about current proposal, 
thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-22405

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19649.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19649


commit 5c628be6b6838b224a27e06731f686a5182e1bad
Author: jerryshao 
Date:   2017-11-03T01:48:48Z

Add more ExternalCatalogEvent




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18118
  
**[Test build #83372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83372/testReport)**
 for PR 18118 at commit 
[`0e9507e`](https://github.com/apache/spark/commit/0e9507e13a352b59798993228709bf7654747d0c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148696592
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.nio.file.Paths
+import java.util.Arrays
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types._
+
+class ImageSchemaSuite extends SparkFunSuite with MLlibTestSparkContext {
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images"
+
+  test("Smoke test: create basic ImageSchema dataframe") {
+val origin = "path"
+val width = 1
+val height = 1
+val nChannels = 3
+val data = Array[Byte](0, 0, 0)
+val mode = ocvTypes("CV_8UC3")
+
+// Internal Row corresponds to image StructType
+val rows = Seq(Row(Row(origin, height, width, nChannels, mode, data)),
+  Row(Row(null, height, width, nChannels, mode, data)))
+val rdd = sc.makeRDD(rows)
+val df = spark.createDataFrame(rdd, ImageSchema.imageSchema)
+
+assert(df.count === 2, "incorrect image count")
+assert(df.schema("image").dataType == columnSchema, "data do not fit 
ImageSchema")
+  }
+
+  test("readImages count test") {
+var df = readImages(imagePath, recursive = false)
+assert(df.count === 1)
+
+df = readImages(imagePath, recursive = true, dropImageFailures = false)
+assert(df.count === 9)
+
+df = readImages(imagePath, recursive = true, dropImageFailures = true)
+val countTotal = df.count
+assert(countTotal === 7)
+
+df = readImages(imagePath, recursive = true, sampleRatio = 0.5, 
dropImageFailures = true)
--- End diff --

This would be a good reason to have a seed: We can make the test 
deterministic to avoid flakiness (from occasionally having an actual 
sampleRatio of 0 or 1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148695824
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
--- End diff --

If this whole class is marked Experimental, then it's OK not to mark each 
field/method Experimental.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148695893
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+import numpy as np
--- End diff --

style: order imports with standard python libraries imported before pyspark


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148695760
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
--- End diff --

This is why we need first-class support for UDTs!  :)  If this will stay 
public, can you please add some Scaladoc to it and also review the fields to 
tighten the privacy (probably to ```private[ml]```) where reasonable?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148698771
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148698295
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
--- End diff --

There isn't a great option for Scala- and Java-friendly maps.  So far, we 
tend to prefix with "java" as in "javaOcvTypes"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148694771
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+  /**
+   * Sets the spark recursive flag and then restores it.
+   *
+   * @param value Value to set
+   * @param spark Existing spark session
+   * @param f The function to evaluate after setting the flag
+   * @return Returns the evaluation result T of the function
+   */
+  def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): 
T = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+hadoopConf.set(flagName, value.toString)
+try f finally {
+  old match {
+case Some(v) => hadoopConf.set(flagName, v)
+case None => hadoopConf.unset(flagName)
+  }
+}
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --

Tell me if this SamplePathFilter has already been discussed; I may have 
missed it in the many comments above.  I'm worried about it being 
deterministic, but I'm also not that familiar with the Hadoop APIs being used 
here.
* If the DataFrame is reloaded (recomputed), or if a task fails and that 
partition is recomputed, then will random.nextDouble() really produce the same 
results?
* I'd expect we'd need to set a seed, as @thunterdb suggested.  I'm fine 
with a fixed seed, though it'd be nice to have it configurable in the future.
* Even if we set a seed, then is random.nextDouble computed in a fixed 
order over each partition?

We've run into a lot of issues in both RDD and DataFrame sampling methods 
with non-deterministic results, so I want to be careful here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148693252
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1818,6 +1819,24 @@ def tearDown(self):
 del self.data
 
 
+class ImageReaderTest(SparkSessionTestCase):
+
+def test_read_images(self):
+data_path = 'python/test_support/image/kittens'
--- End diff --

Could you please move the images used in Python tests to the 
data/mllib/images/ directory too?  Python tests already use data in data/mllib/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148695558
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
--- End diff --

This can be package private.  (Python ignores package private limitations.) 
 Or rename it as javaOcvTypes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148695330
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r148695505
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
--- End diff --

For legibility, it'd be nice to define the imageFields values here 
(inline).  You could then define imageFields by extracting those values from 
columnSchema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread pralabhkumar

Github user pralabhkumar commented on the issue:

https://github.com/apache/spark/pull/18118
  
Jenkins test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-02 Thread pralabhkumar

Github user pralabhkumar commented on the issue:

https://github.com/apache/spark/pull/18118
  
@sethah Its still failing , I don't think so its issue from my side.  
Please help


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19648
  
**[Test build #83371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83371/testReport)**
 for PR 19648 at commit 
[`5eee170`](https://github.com/apache/spark/commit/5eee170bfa67b7dd75b4a93c30559c354b99b541).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/19648
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19648
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19648
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83370/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-11-02 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/18538
  
@jkbradley @mgaido91 I just sent #19648 to move test data to data/mllib, 
please feel free to review it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...

2017-11-02 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19646#discussion_r148696316
  
--- Diff: python/pyspark/sql/session.py ---
@@ -512,9 +512,39 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
+import numpy as np
+
+# Convert pandas.DataFrame to list of numpy records
+np_records = data.to_records(index=False)
+
+# Check if any columns need to be fixed for Spark to infer 
properly
+record_type_list = None
+if schema is None and len(np_records) > 0:
+cur_dtypes = np_records[0].dtype
+col_names = cur_dtypes.names
+record_type_list = []
+has_rec_fix = False
+for i in xrange(len(cur_dtypes)):
--- End diff --

oh, session.py didn't define xrange for version > 3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19479
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...

2017-11-02 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19648
  
**[Test build #83368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83368/testReport)**
 for PR 19648 at commit 
[`0fd9a5c`](https://github.com/apache/spark/commit/0fd9a5c617053ced4210432f261a0053a04442dd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

2017-11-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19479
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83369/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvalu...

2017-11-02 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/19648

[SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to 
data/mllib.

## What changes were proposed in this pull request?
Move ```ClusteringEvaluatorSuite``` test data(iris) to data/mllib, to 
prevent from re-creating a new folder.

## How was this patch tested?
Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-14516

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19648


commit 0fd9a5c617053ced4210432f261a0053a04442dd
Author: Yanbo Liang 
Date:   2017-11-03T01:03:23Z

Move ClusteringEvaluatorSuite test data to data/mllib.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 457 matches

Mail list logo