[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19649 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83375/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19649 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148712557 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +A singleton-like attribute of :class:`_ImageSchema` in this module. --- End diff -- Probably, we should avoid the term singleton. It might be technically inappropriate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19649 **[Test build #83375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83375/testReport)** for PR 19649 at commit [`a3867b7`](https://github.com/apache/spark/commit/a3867b78c1a64f7d3196aaef6ab63db740dcc758). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19650: [SPARK-22254][core] Fix the arrayMax in BufferHolder
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19650 @srowen @gatorsmile could you please review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148712017 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +A singleton-like attribute of :class:`_ImageSchema` in this module. --- End diff -- I would like to keep it simple if the current way does not cause a harm to the functionalities, for now, if it sounds okay to you and other reviewers too. Please let me know if I missed anything here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148711768 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +A singleton-like attribute of :class:`_ImageSchema` in this module. --- End diff -- > Is this a standard way to define singletons in Python? Up to my knowledge, there are many workarounds to resemble the singleton but I am pretty sure there is no standard way AFAIK. We have similar examples for this pattern within Spark: https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/accumulators.py#L104 https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/storagelevel.py#L52-L58 There is an example for another approach within Spark actually: https://github.com/apache/spark/blob/17af727e38c3faaeab5b91a8cdab5f2181cf3fc4/python/pyspark/sql/types.py#L96-L106 I am okay to use this approach as well. However, basically they should the similar things in our case. This way also requires to create at least single instance. The point of this approach I believe is to return the same instance for `__init__` but I think our case should disallow `__init__` itself. > What happens when this package or module gets reloaded? I think reloading won't affect the functionalities here. It creates `_ImageSchema` class again, creates an instance of `_ImageSchema` and monkey-paches `__init__` to disallow creating instances. ```python >>> from pyspark.ml import image >>> print image.ImageSchema._imageSchema None >>> print image.ImageSchema.imageSchema StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true))) >>> print image.ImageSchema._imageSchema StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true))) >>> reload(image) >>> print image.ImageSchema._imageSchema None >>> print image.ImageSchema.imageSchema StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true))) >>> print image.ImageSchema._imageSchema StructType(List(StructField(image,StructType(List(StructField(origin,StringType,true),StructField(height,IntegerType,false),StructField(width,IntegerType,false),StructField(nChannels,IntegerType,false),StructField(mode,IntegerType,false),StructField(data,BinaryType,false))),true))) ``` I think we are fine if we happen to have multiple instances for `ImageSchema` actually because it only has cached attributes in each instance. > Numpy uses a somewhat different approach: https://github.com/numpy/numpy/blob/d75b86c0c49f7eb3ec60564c2e23b3ff237082a2/numpy/_globals.py I think Numpy case tries to keep the same object comparison clean, `a is b`. In our case, I guess it does not require actually. (and partly this is a reason why I used the term "singleton-like"). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19532 **[Test build #83377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83377/testReport)** for PR 19532 at commit [`9670d6f`](https://github.com/apache/spark/commit/9670d6f8b3fed58f556b9050814f7c61bc5d65be). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19650: [SPARK-22254][core] Fix the arrayMax in BufferHolder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19650 **[Test build #83376 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83376/testReport)** for PR 19650 at commit [`0a536fb`](https://github.com/apache/spark/commit/0a536fb0dbaa9e1e929610ecc15ac416253fd823). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19650: [SPARK-22254][core] Fix the arrayMax in BufferHol...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/19650 [SPARK-22254][core] Fix the arrayMax in BufferHolder ## What changes were proposed in this pull request? This PR replaces the old the maximum array size (`Int.MaxValue`) with the new one (`ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH`). This PR also refactor the code to calculate the new array size to easily understand why we have to use `newSize - 2` for allocating a new array. ## How was this patch tested? Used the existing test You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-22254 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19650.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19650 commit 0a536fb0dbaa9e1e929610ecc15ac416253fd823 Author: Kazuaki IshizakiDate: 2017-11-03T03:14:00Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #83382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83382/testReport)** for PR 19651 at commit [`fdde274`](https://github.com/apache/spark/commit/fdde27416fe54036afbd9b809a363e7871df67cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19643#discussion_r148710840 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } +#' Adds a JAR dependency for Spark tasks to be executed in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. --- End diff -- is `local:/path` referring to windows drive/path, or the actual text `local:/` should be there? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19643#discussion_r148710799 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } +#' Adds a JAR dependency for Spark tasks to be executed in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current driver. --- End diff -- hmm, is this right `add the jar to the current driver.`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19643#discussion_r148710943 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } +#' Adds a JAR dependency for Spark tasks to be executed in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current driver. +#' +#' @rdname spark.addJar +#' @param path The path of the jar to be added +#' @param addToCurrentClassLoader Whether to add the jar to the current driver class loader. +#' @export +#' @examples +#'\dontrun{ +#' spark.addJar("/path/to/something.jar", TRUE) +#'} +#' @note spark.addJar since 2.3.0 +spark.addJar <- function(path, addToCurrentClassLoader = FALSE) { + normalizedPath <- suppressWarnings(normalizePath(path)) --- End diff -- yea, normalizePath wouldn't handle url... https://stat.ethz.ch/R-manual/R-devel/library/base/html/normalizePath.html I think we should require absolute paths in their canonical form here and just pass through.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pus...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19647#discussion_r148710991 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -332,32 +332,18 @@ object LimitPushDown extends Rule[LogicalPlan] { // pushdown Limit. case LocalLimit(exp, Union(children)) => LocalLimit(exp, Union(children.map(maybePushLocalLimit(exp, _ -// Add extra limits below OUTER JOIN. For LEFT OUTER and FULL OUTER JOIN we push limits to the -// left and right sides, respectively. For FULL OUTER JOIN, we can only push limits to one side -// because we need to ensure that rows from the limited side still have an opportunity to match -// against all candidates from the non-limited side. We also need to ensure that this limit -// pushdown rule will not eventually introduce limits on both sides if it is applied multiple -// times. Therefore: +// Add extra limits below OUTER JOIN. For LEFT OUTER and RIGHT OUTER JOIN we push limits to +// the left and right sides, respectively. It's not safe to push limits below FULL OUTER +// JOIN in the general case without a more invasive rewrite. +// We also need to ensure that this limit pushdown rule will not eventually introduce limits +// on both sides if it is applied multiple times. Therefore: // - If one side is already limited, stack another limit on top if the new limit is smaller. // The redundant limit will be collapsed by the CombineLimits rule. // - If neither side is limited, limit the side that is estimated to be bigger. case LocalLimit(exp, join @ Join(left, right, joinType, _)) => val newJoin = joinType match { case RightOuter => join.copy(right = maybePushLocalLimit(exp, right)) case LeftOuter => join.copy(left = maybePushLocalLimit(exp, left)) -case FullOuter => --- End diff -- Thanks for working on it! We should still keep it. Let me fix it based on my original PR: https://github.com/apache/spark/pull/10454 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83381/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19651 [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 1.4.1 ## What changes were proposed in this pull request? Since [SPARK-2883](https://issues.apache.org/jira/browse/SPARK-2883), Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This PR aims to add a new ORC data source inside `sql/core` and to replace the old ORC data source eventually. This PR resolves the following three issues. - SPARK-20682: Add new ORCFileFormat based on Apache ORC 1.4.1 - SPARK-15474: ORC data source fails to write and read back empty dataframe - SPARK-21791: ORC should support column names with dot ## How was this patch tested? Pass the Jenkins with the existing all tests and new tests for SPARK-15474 and SPARK-21791. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-20682 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19651 commit fdde27416fe54036afbd9b809a363e7871df67cf Author: Dongjoon HyunDate: 2017-05-15T02:33:15Z [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on Apache ORC 1.4.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19624: [SPARKR][SPARK-22315] Warn if SparkR package version doe...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19624 Maybe but it will say `Version mismatch between Spark JVM and SparkR package. JVM version was 2.3.0-SNAPSHOT, while R package version was 2.1.2` I think it will be clear the numbers are different. If they are both 2.3.0 it won't show the warning. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148707442 --- Diff: python/pyspark/sql/session.py --- @@ -512,9 +512,39 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr except Exception: has_pandas = False if has_pandas and isinstance(data, pandas.DataFrame): +import numpy as np + +# Convert pandas.DataFrame to list of numpy records +np_records = data.to_records(index=False) + +# Check if any columns need to be fixed for Spark to infer properly +record_type_list = None +if schema is None and len(np_records) > 0: +cur_dtypes = np_records[0].dtype +col_names = cur_dtypes.names +record_type_list = [] +has_rec_fix = False +for i in xrange(len(cur_dtypes)): --- End diff -- Ooops, I forgot about that. thx! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148707441 --- Diff: python/pyspark/sql/session.py --- @@ -512,9 +512,39 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr except Exception: has_pandas = False if has_pandas and isinstance(data, pandas.DataFrame): +import numpy as np + +# Convert pandas.DataFrame to list of numpy records +np_records = data.to_records(index=False) + +# Check if any columns need to be fixed for Spark to infer properly +record_type_list = None +if schema is None and len(np_records) > 0: +cur_dtypes = np_records[0].dtype +col_names = cur_dtypes.names +record_type_list = [] +has_rec_fix = False +for i in xrange(len(cur_dtypes)): +curr_type = cur_dtypes[i] +# If type is a datetime64 timestamp, convert to microseconds +# NOTE: if dtype is M8[ns] then np.record.tolist() will output values as longs, +# this conversion will lead to an output of py datetime objects, see SPARK-22417 +if curr_type == np.dtype('M8[ns]'): +curr_type = 'M8[us]' +has_rec_fix = True +record_type_list.append((str(col_names[i]), curr_type)) +if not has_rec_fix: +record_type_list = None --- End diff -- Yeah, probably a good idea. I'll see if I can clean it up some. Thanks @viirya ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148709143 --- Diff: python/pyspark/sql/session.py --- @@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _getNumpyRecordDtypes(self, rec): +""" +Used when converting a pandas.DataFrame to Spark using to_records(), this will correct +the dtypes of records so they can be properly loaded into Spark. +:param rec: a numpy record to check dtypes +:return corrected dtypes for a numpy.record or None if no correction needed +""" +import numpy as np +cur_dtypes = rec.dtype +col_names = cur_dtypes.names +record_type_list = [] +has_rec_fix = False +for i in xrange(len(cur_dtypes)): +curr_type = cur_dtypes[i] +# If type is a datetime64 timestamp, convert to microseconds +# NOTE: if dtype is M8[ns] then np.record.tolist() will output values as longs, +# this conversion will lead to an output of py datetime objects, see SPARK-22417 +if curr_type == np.dtype('M8[ns]'): --- End diff -- There shouldn't be any difference for the most part. I only used `M8` here because when debugging these types, that is what was being output for the record types by `numpy.record.dtype`. Would you prefer `datetime64` if that works? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMetrics
Github user tengpeng commented on the issue: https://github.com/apache/spark/pull/19638 I have used @sethah 's approach to address the issues we have. Since we are not adding a new method to the public trait, there is no more binary compatibility issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19646 **[Test build #83380 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83380/testReport)** for PR 19646 at commit [`3944b5c`](https://github.com/apache/spark/commit/3944b5ca3f2ac588ee4a997ca55c10ed1456c7cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19646 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19646 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83380/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19646 **[Test build #83380 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83380/testReport)** for PR 19646 at commit [`3944b5c`](https://github.com/apache/spark/commit/3944b5ca3f2ac588ee4a997ca55c10ed1456c7cf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19646 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19643 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19643 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83365/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19646 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19532 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19532 **[Test build #83377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83377/testReport)** for PR 19532 at commit [`9670d6f`](https://github.com/apache/spark/commit/9670d6f8b3fed58f556b9050814f7c61bc5d65be). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19532 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83377/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148708752 --- Diff: python/pyspark/sql/session.py --- @@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _getNumpyRecordDtypes(self, rec): +""" +Used when converting a pandas.DataFrame to Spark using to_records(), this will correct +the dtypes of records so they can be properly loaded into Spark. +:param rec: a numpy record to check dtypes +:return corrected dtypes for a numpy.record or None if no correction needed +""" +import numpy as np +cur_dtypes = rec.dtype +col_names = cur_dtypes.names +record_type_list = [] +has_rec_fix = False +for i in xrange(len(cur_dtypes)): +curr_type = cur_dtypes[i] +# If type is a datetime64 timestamp, convert to microseconds +# NOTE: if dtype is M8[ns] then np.record.tolist() will output values as longs, +# this conversion will lead to an output of py datetime objects, see SPARK-22417 +if curr_type == np.dtype('M8[ns]'): --- End diff -- Isn't this `datetime64[ns]`? What's the defference between `M8[ns]` and `datetime64[ns]`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83379/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19643 **[Test build #83365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83365/testReport)** for PR 19643 at commit [`b928ab8`](https://github.com/apache/spark/commit/b928ab82c54a72f91c01ba9d3418c80c58ab9da1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19646 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19532: [CORE]Modify the duration real-time calculation and upda...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/19532 Thank you for your review comments, I have to restore the code, not running in the code calculation. Now only keep the document changes. Please review again. @srowen @jiangxb1987 @cloud-fan @ajbozarth --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83378/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19646: [SPARK-22147][PYTHON] Fix for createDataFrame from panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19646 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148709362 --- Diff: python/pyspark/sql/session.py --- @@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _getNumpyRecordDtypes(self, rec): +""" +Used when converting a pandas.DataFrame to Spark using to_records(), this will correct +the dtypes of records so they can be properly loaded into Spark. +:param rec: a numpy record to check dtypes +:return corrected dtypes for a numpy.record or None if no correction needed +""" +import numpy as np +cur_dtypes = rec.dtype +col_names = cur_dtypes.names +record_type_list = [] +has_rec_fix = False +for i in xrange(len(cur_dtypes)): +curr_type = cur_dtypes[i] +# If type is a datetime64 timestamp, convert to microseconds +# NOTE: if dtype is M8[ns] then np.record.tolist() will output values as longs, +# this conversion will lead to an output of py datetime objects, see SPARK-22417 +if curr_type == np.dtype('M8[ns]'): --- End diff -- Yes, I'd prefer it if that works, otherwise I'd like you to add some comments saying we can use `M8[ns]` instead of `datetime64[ns]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148709757 --- Diff: python/pyspark/sql/tests.py --- @@ -2592,6 +2592,16 @@ def test_create_dataframe_from_array_of_long(self): df = self.spark.createDataFrame(data) self.assertEqual(df.first(), Row(longarray=[-9223372036854775808, 0, 9223372036854775807])) +@unittest.skipIf(not _have_pandas, "Pandas not installed") +def test_create_dataframe_from_pandas_with_timestamp(self): +import pandas as pd +from datetime import datetime +pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)], +"d": [pd.Timestamp.now().date()]}) +df = self.spark.createDataFrame(pdf) --- End diff -- What if we specify the schema? For example: ``` df = self.spark.createDataFrame(pdf, "ts timestamp, d date") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r148709163 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -202,6 +202,15 @@ class LinearSVCSuite extends SparkFunSuite with MLlibTestSparkContext with Defau dataset.as[LabeledPoint], estimator, modelEquals, 42L) } + test("prediction on single instance") { +val trainer = new LinearSVC() +val model = trainer.fit(smallBinaryDataset) +model.transform(smallBinaryDataset).select("features", "prediction").collect().foreach { + case Row(features: Vector, prediction: Double) => +assert(prediction ~== model.predict(features) relTol 1E-5) --- End diff -- Could you please check for exact equality here (& in other tests)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r148708939 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala --- @@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Vector, NaiveBayesModel](model, testDataset) } + test("prediction on single instance") { +val nPoints = 1000 +val piArray = Array(0.5, 0.1, 0.4).map(math.log) +val thetaArray = Array( + Array(0.70, 0.10, 0.10, 0.10), // label 0 + Array(0.10, 0.70, 0.10, 0.10), // label 1 + Array(0.10, 0.10, 0.70, 0.10) // label 2 +).map(_.map(math.log)) +val pi = Vectors.dense(piArray) +val theta = new DenseMatrix(3, 4, thetaArray.flatten, true) + +val testDataset = + generateNaiveBayesInput(piArray, thetaArray, nPoints, seed, "multinomial").toDF() +val nb = new NaiveBayes().setSmoothing(1.0).setModelType("multinomial") +val model = nb.fit(testDataset) + +validateModelFit(pi, theta, model) --- End diff -- Do we need lines 184-186? They seem unrelated to what we want to test (that `predict` produces the same result as `transform` on a single instance). Similarly, I don't think we need to create `piArray`, `thetaArray`, `pi`, `theta`, etc; this test should just fit a model on a dataset and compare the fitted model's `predict` and `transform` outputs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19381#discussion_r148709125 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala --- @@ -165,6 +165,35 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Vector, NaiveBayesModel](model, testDataset) } + test("prediction on single instance") { +val nPoints = 1000 +val piArray = Array(0.5, 0.1, 0.4).map(math.log) +val thetaArray = Array( + Array(0.70, 0.10, 0.10, 0.10), // label 0 + Array(0.10, 0.70, 0.10, 0.10), // label 1 + Array(0.10, 0.10, 0.70, 0.10) // label 2 +).map(_.map(math.log)) +val pi = Vectors.dense(piArray) +val theta = new DenseMatrix(3, 4, thetaArray.flatten, true) + +val testDataset = --- End diff -- Suggestion: rename to `trainDataset`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148706148 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { +
[GitHub] spark issue #19625: [SPARK-22407][WEB-UI] Add rdd id column on storage page ...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/19625 Please upload the screenshot in PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user pralabhkumar commented on the issue: https://github.com/apache/spark/pull/18118 @sethah Build is passed :) ,have done the changes as suggested (setting maxIter and maxDepth). ping @MLnick or @jkbradley so we can move ahead with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18118 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18118 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83372/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18118 **[Test build #83372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83372/testReport)** for PR 18118 at commit [`0e9507e`](https://github.com/apache/spark/commit/0e9507e13a352b59798993228709bf7654747d0c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19642 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83366/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19642 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19642 **[Test build #83366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83366/testReport)** for PR 19642 at commit [`2711eaf`](https://github.com/apache/spark/commit/2711eaf25f55c954842ee9498f9706df33d69e1a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r148702816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,456 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'keep' (invalid data produces a vector of zeros) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'keep' (invalid data produces a vector of zeros) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) + + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputColNames = $(inputCols) +val outputColNames = $(outputCols) +val existingFields = schema.fields + +require(inputColNames.length == outputColNames.length, + s"The number of input columns ${inputColNames.length} must be the same as the number of " + +s"output columns ${outputColNames.length}.") + +inputColNames.zip(outputColNames).map { case (inputColName, outputColName) => + require(schema(inputColName).dataType.isInstanceOf[NumericType], +s"Input column must be of type NumericType but got ${schema(inputColName).dataType}") + require(!existingFields.exists(_.name == outputColName), +s"Output column $outputColName already exists.") +} + +// Prepares output columns with proper attributes by examining input columns. +val inputFields = $(inputCols).map(schema(_)) + +val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) => + OneHotEncoderCommon.transformOutputColumnSchema( +inputField, $(dropLast), outputColName) +} +StructType(schema.fields ++ outputFields) + } +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19649 **[Test build #83375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83375/testReport)** for PR 19649 at commit [`a3867b7`](https://github.com/apache/spark/commit/a3867b78c1a64f7d3196aaef6ab63db740dcc758). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148702060 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { +
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148701873 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -236,12 +252,17 @@ object CrossValidator extends MLReadable[CrossValidator] { class CrossValidatorModel private[ml] ( @Since("1.4.0") override val uid: String, @Since("1.2.0") val bestModel: Model[_], -@Since("1.5.0") val avgMetrics: Array[Double]) +@Since("1.5.0") val avgMetrics: Array[Double], +@Since("2.3.0") val subModels: Option[Array[Array[Model[_) --- End diff -- So also need a `setSubModels` method for setting it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19648 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19648 **[Test build #83371 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83371/testReport)** for PR 19648 at commit [`5eee170`](https://github.com/apache/spark/commit/5eee170bfa67b7dd75b4a93c30559c354b99b541). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19648 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83371/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148701451 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) instr.logParams(numFolds, seed, parallelism) logTuningParams(instr) +val collectSubModelsParam = $(collectSubModels) + +var subModels: Option[Array[Array[Model[_ = if (collectSubModelsParam) { --- End diff -- @holdenk @jkbradley I already thought about this issue. The reason I use this way is: 1) When `$(collectSubModels) == false`, the `modelFutures` and `foldMetricFutures` will be executed in pipelined way, this will make sure that the `model` generated in `modelFutures` will be released in time, so that the maximum memory cost will be `numParallelism * sizeof(model)`. If we use the way of "collecting modelFutures", it will increase the memory cost to be `$(estimatorParamMaps).length * sizeof(model)` . This is a serious issue which is discussed before. 2) IMO the mutation on L145 won't influence performance. and it do not need something like lock, there is no race condition. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19648 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19648 **[Test build #83368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83368/testReport)** for PR 19648 at commit [`0fd9a5c`](https://github.com/apache/spark/commit/0fd9a5c617053ced4210432f261a0053a04442dd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19648 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83368/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19641 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83374/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19641 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148700390 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { +
[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19641 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148700189 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { +
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19649 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83373/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19649 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148699685 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +A singleton-like attribute of :class:`_ImageSchema` in this module. --- End diff -- Is this a standard way to define singletons in Python? I've seen several methods searching online. Numpy uses a somewhat different approach: https://github.com/numpy/numpy/blob/d75b86c0c49f7eb3ec60564c2e23b3ff237082a2/numpy/_globals.py What happens when this package or module gets reloaded? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19649: [SPARK-22405][SQL] Add more ExternalCatalogEvent
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/19649 [SPARK-22405][SQL] Add more ExternalCatalogEvent ## What changes were proposed in this pull request? We're building a data lineage tool in which we need to monitor the metadata changes in ExternalCatalog, current ExternalCatalog already provides several useful events like "CreateDatabaseEvent" for custom SparkListener to use. But still there's some event missing, like alter database event and alter table event. So here propose to and new ExternalCatalogEvent. ## How was this patch tested? Enrich the current UT and tested on local cluster. CC @hvanhovell please let me know your comments about current proposal, thanks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-22405 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19649 commit 5c628be6b6838b224a27e06731f686a5182e1bad Author: jerryshaoDate: 2017-11-03T01:48:48Z Add more ExternalCatalogEvent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18118 **[Test build #83372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83372/testReport)** for PR 18118 at commit [`0e9507e`](https://github.com/apache/spark/commit/0e9507e13a352b59798993228709bf7654747d0c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148696592 --- Diff: mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala --- @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.nio.file.Paths +import java.util.Arrays + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.image.ImageSchema._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.Row +import org.apache.spark.sql.types._ + +class ImageSchemaSuite extends SparkFunSuite with MLlibTestSparkContext { + // Single column of images named "image" + private lazy val imagePath = "../data/mllib/images" + + test("Smoke test: create basic ImageSchema dataframe") { +val origin = "path" +val width = 1 +val height = 1 +val nChannels = 3 +val data = Array[Byte](0, 0, 0) +val mode = ocvTypes("CV_8UC3") + +// Internal Row corresponds to image StructType +val rows = Seq(Row(Row(origin, height, width, nChannels, mode, data)), + Row(Row(null, height, width, nChannels, mode, data))) +val rdd = sc.makeRDD(rows) +val df = spark.createDataFrame(rdd, ImageSchema.imageSchema) + +assert(df.count === 2, "incorrect image count") +assert(df.schema("image").dataType == columnSchema, "data do not fit ImageSchema") + } + + test("readImages count test") { +var df = readImages(imagePath, recursive = false) +assert(df.count === 1) + +df = readImages(imagePath, recursive = true, dropImageFailures = false) +assert(df.count === 9) + +df = readImages(imagePath, recursive = true, dropImageFailures = true) +val countTotal = df.count +assert(countTotal === 7) + +df = readImages(imagePath, recursive = true, sampleRatio = 0.5, dropImageFailures = true) --- End diff -- This would be a good reason to have a seed: We can make the test deterministic to avoid flakiness (from occasionally having an actual sampleRatio of 0 or 1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148695824 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: --- End diff -- If this whole class is marked Experimental, then it's OK not to mark each field/method Experimental. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148695893 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +A singleton-like attribute of :class:`_ImageSchema` in this module. + +.. autoclass:: _ImageSchema + :members: +""" + +from pyspark import SparkContext +from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string +from pyspark.sql import DataFrame, SparkSession +import numpy as np --- End diff -- style: order imports with standard python libraries imported before pyspark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148695760 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { --- End diff -- This is why we need first-class support for UDTs! :) If this will stay public, can you please add some Scaladoc to it and also review the fields to tighten the privacy (probably to ```private[ml]```) where reasonable? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148698771 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { + +
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148698295 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( --- End diff -- There isn't a great option for Scala- and Java-friendly maps. So far, we tend to prefix with "java" as in "javaOcvTypes" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148694771 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + /** + * Sets the spark recursive flag and then restores it. + * + * @param value Value to set + * @param spark Existing spark session + * @param f The function to evaluate after setting the flag + * @return Returns the evaluation result T of the function + */ + def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): T = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) +hadoopConf.set(flagName, value.toString) +try f finally { + old match { +case Some(v) => hadoopConf.set(flagName, v) +case None => hadoopConf.unset(flagName) + } +} + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { --- End diff -- Tell me if this SamplePathFilter has already been discussed; I may have missed it in the many comments above. I'm worried about it being deterministic, but I'm also not that familiar with the Hadoop APIs being used here. * If the DataFrame is reloaded (recomputed), or if a task fails and that partition is recomputed, then will random.nextDouble() really produce the same results? * I'd expect we'd need to set a seed, as @thunterdb suggested. I'm fine with a fixed seed, though it'd be nice to have it configurable in the future. * Even if we set a seed, then is random.nextDouble computed in a fixed order over each partition? We've run into a lot of issues in both RDD and DataFrame sampling methods with non-deterministic results, so I want to be careful here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148693252 --- Diff: python/pyspark/ml/tests.py --- @@ -1818,6 +1819,24 @@ def tearDown(self): del self.data +class ImageReaderTest(SparkSessionTestCase): + +def test_read_images(self): +data_path = 'python/test_support/image/kittens' --- End diff -- Could you please move the images used in Python tests to the data/mllib/images/ directory too? Python tests already use data in data/mllib/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148695558 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava --- End diff -- This can be package private. (Python ignores package private limitations.) Or rename it as javaOcvTypes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148695330 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { + +
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148695505 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( --- End diff -- For legibility, it'd be nice to define the imageFields values here (inline). You could then define imageFields by extracting those values from columnSchema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user pralabhkumar commented on the issue: https://github.com/apache/spark/pull/18118 Jenkins test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...
Github user pralabhkumar commented on the issue: https://github.com/apache/spark/pull/18118 @sethah Its still failing , I don't think so its issue from my side. Please help --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19648 **[Test build #83371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83371/testReport)** for PR 19648 at commit [`5eee170`](https://github.com/apache/spark/commit/5eee170bfa67b7dd75b4a93c30559c354b99b541). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/19648 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19648 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19648 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83370/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18538 @jkbradley @mgaido91 I just sent #19648 to move test data to data/mllib, please feel free to review it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19646: [SPARK-22147][PYTHON] Fix for createDataFrame fro...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19646#discussion_r148696316 --- Diff: python/pyspark/sql/session.py --- @@ -512,9 +512,39 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr except Exception: has_pandas = False if has_pandas and isinstance(data, pandas.DataFrame): +import numpy as np + +# Convert pandas.DataFrame to list of numpy records +np_records = data.to_records(index=False) + +# Check if any columns need to be fixed for Spark to infer properly +record_type_list = None +if schema is None and len(np_records) > 0: +cur_dtypes = np_records[0].dtype +col_names = cur_dtypes.names +record_type_list = [] +has_rec_fix = False +for i in xrange(len(cur_dtypes)): --- End diff -- oh, session.py didn't define xrange for version > 3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19479 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSui...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19648 **[Test build #83368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83368/testReport)** for PR 19648 at commit [`0fd9a5c`](https://github.com/apache/spark/commit/0fd9a5c617053ced4210432f261a0053a04442dd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19479 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83369/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvalu...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/19648 [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvaluatorSuite test data to data/mllib. ## What changes were proposed in this pull request? Move ```ClusteringEvaluatorSuite``` test data(iris) to data/mllib, to prevent from re-creating a new folder. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-14516 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19648.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19648 commit 0fd9a5c617053ced4210432f261a0053a04442dd Author: Yanbo LiangDate: 2017-11-03T01:03:23Z Move ClusteringEvaluatorSuite test data to data/mllib. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org