[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19977 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85495/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19977 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19977 **[Test build #85495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85495/testReport)** for PR 19977 at commit [`57a9d1e`](https://github.com/apache/spark/commit/57a9d1e9da21d56873c97eac08797499199a0c7b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19683 **[Test build #85502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85502/testReport)** for PR 19683 at commit [`1c6626a`](https://github.com/apache/spark/commit/1c6626acad080404a73519735bc1b3a0fbf6e303). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...
Github user uzadude commented on a diff in the pull request: https://github.com/apache/spark/pull/19683#discussion_r159033662 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala --- @@ -85,11 +84,19 @@ case class GenerateExec( val numOutputRows = longMetric("numOutputRows") child.execute().mapPartitionsWithIndexInternal { (index, iter) => val generatorNullRow = new GenericInternalRow(generator.elementSchema.length) - val rows = if (join) { + val rows = if (requiredChildOutput.nonEmpty) { + +val pruneChildForResult: InternalRow => InternalRow = + if ((child.outputSet -- requiredChildOutput).isEmpty) { --- End diff -- wouldn't it always return false? or should I use `child.output == AttributeSet(requiredChildOutput)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20082: [SPARK-22897][CORE]: Expose stageAttemptId in Tas...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20082#discussion_r159033482 --- Diff: core/src/main/scala/org/apache/spark/TaskContext.scala --- @@ -150,6 +150,13 @@ abstract class TaskContext extends Serializable { */ def stageId(): Int + /** + * An ID that is unique to the stage attempt that this task belongs to. It represents how many + * times the stage has been attempted. The first stage attempt will be assigned stageAttemptId = 0 + * , and subsequent attempts will increasing stageAttemptId one by one. + */ + def stageAttemptId(): Int --- End diff -- My concern is that, internally we use `stageAttemptId`, and internally we call `TaskContext.taskAttemptId` `taskId`. However, for end users, they don't know the internal code, and they are more familiar with `TaskContext`. I think the naming should be consistent with the public API `TaskContext`, instead of internal code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19683#discussion_r159033021 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala --- @@ -85,11 +84,19 @@ case class GenerateExec( val numOutputRows = longMetric("numOutputRows") child.execute().mapPartitionsWithIndexInternal { (index, iter) => val generatorNullRow = new GenericInternalRow(generator.elementSchema.length) - val rows = if (join) { + val rows = if (requiredChildOutput.nonEmpty) { + +val pruneChildForResult: InternalRow => InternalRow = + if ((child.outputSet -- requiredChildOutput).isEmpty) { --- End diff -- just `child.output == requiredChildOutput`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19683#discussion_r159032990 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala --- @@ -47,8 +47,13 @@ private[execution] sealed case class LazyIterator(func: () => TraversableOnce[In * terminate(). * * @param generator the generator expression - * @param join when true, each output row is implicitly joined with the input tuple that produced - * it. + * @param requiredChildOutput this paramter starts as Nil and gets filled by the Optimizer. --- End diff -- we don't need to duplicate the comment here, just say `required attributes from child output` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20062: [SPARK-22892] [SQL] Simplify some estimation logi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20062 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20062: [SPARK-22892] [SQL] Simplify some estimation logic by us...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20062 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19683 **[Test build #85501 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85501/testReport)** for PR 19683 at commit [`8f06dda`](https://github.com/apache/spark/commit/8f06dda16c692cdc2204eddaee5ae3ba2321258d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20105: [SPARK-22920][SPARKR] sql functions for current_d...
GitHub user felixcheung reopened a pull request: https://github.com/apache/spark/pull/20105 [SPARK-22920][SPARKR] sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString ## What changes were proposed in this pull request? Add sql functions ## How was this patch tested? manual, unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rsqlfuncs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20105 commit c8e118a4c6e0f9f05d3c48c3d715da82b4ebd334 Author: Felix Cheung Date: 2017-12-28T11:11:41Z ltrim/rtrim/trim with trimString + current_date() + current_timestamp() commit 284c74a7e74fb24024cf4cc6557b30e1169cf445 Author: Felix Cheung Date: 2017-12-28T11:12:18Z NeedsCompilation in DESCRIPTION commit 1f2fac3afc376fd61e54cb12b6c34f60f5522280 Author: Felix Cheung Date: 2017-12-28T21:44:29Z fix example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20105: [SPARK-22920][SPARKR] sql functions for current_d...
Github user felixcheung closed the pull request at: https://github.com/apache/spark/pull/20105 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20020: [SPARK-22834][SQL] Make insertion commands have r...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20020 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20020 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19683 LGTM except 2 comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19683#discussion_r159031802 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala --- @@ -57,20 +62,19 @@ private[execution] sealed case class LazyIterator(func: () => TraversableOnce[In */ case class GenerateExec( generator: Generator, -join: Boolean, +unrequiredChildIndex: Seq[Int], --- End diff -- The physical plan can just take `requiredChildOutput`, and in the planner we can just do ``` case g @ logical.Generate(...) => GenerateExec(..., g.requiredChildOutput) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20010 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85497/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20010 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20010 **[Test build #85497 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85497/testReport)** for PR 20010 at commit [`86e1929`](https://github.com/apache/spark/commit/86e1929c490861d9e93ef34abd52c442f99f31a9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19683: [SPARK-21657][SQL] optimize explode quadratic mem...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19683#discussion_r159031526 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -73,25 +73,32 @@ case class Project(projectList: Seq[NamedExpression], child: LogicalPlan) extend * their output. * * @param generator the generator expression - * @param join when true, each output row is implicitly joined with the input tuple that produced - * it. + * @param unrequiredChildIndex this paramter starts as Nil and gets filled by the Optimizer. + * It's used as an optimization for omitting data generation that will + * be discarded next by a projection. + * A common use case is when we explode(array(..)) and are interested + * only in the exploded data and not in the original array. before this + * optimization the array got duplicated for each of its elements, + * causing O(n^^2) memory consumption. (see [SPARK-21657]) * @param outer when true, each input row will be output at least once, even if the output of the * given `generator` is empty. * @param qualifier Qualifier for the attributes of generator(UDTF) * @param generatorOutput The output schema of the Generator. * @param child Children logical plan node */ case class Generate( -generator: Generator, -join: Boolean, -outer: Boolean, -qualifier: Option[String], -generatorOutput: Seq[Attribute], -child: LogicalPlan) + generator: Generator, + unrequiredChildIndex: Seq[Int], + outer: Boolean, + qualifier: Option[String], + generatorOutput: Seq[Attribute], + child: LogicalPlan) --- End diff -- wrong indentation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85498/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20113 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20113 **[Test build #85498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85498/testReport)** for PR 20113 at commit [`408bfed`](https://github.com/apache/spark/commit/408bfed88cd237e5adbf42bd5b4fd2ccf875b5bd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20109: [SPARK-22891][SQL] Make hive client creation thre...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20109 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20109 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19683 **[Test build #85500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85500/testReport)** for PR 19683 at commit [`288aa73`](https://github.com/apache/spark/commit/288aa733e2dce341aca1c60d6800935564fb9843). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20114 How about simply returning `false` from `ArrowVectorAccessor.isNullAt(int rowId)` when `accessor.getValueCount() > 0 && accessor.getValidityBuffer().capacity() == 0` without modifying the buffer? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user uzadude commented on the issue: https://github.com/apache/spark/pull/19683 seems reasonable, let's do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20020 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85493/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20020 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20020 **[Test build #85493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85493/testReport)** for PR 20020 at commit [`18ec016`](https://github.com/apache/spark/commit/18ec01638b9da7f8150e3ea35c2876d6d1f41f3d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user Tagar commented on the issue: https://github.com/apache/spark/pull/19683 There was a similar exception as in failing unit tests was fixed in [SPARK-18300](https://issues.apache.org/jira/browse/SPARK-18300) > java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.Attribute https://github.com/apache/spark/pull/15892 Not sure if this is directly applicable or helpful here though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-22797][PySpark] Bucketizer support multi-column
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 ping @MLnick ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 ping @ueshin @HyukjinKwon Unfortunately, there was a bug in the Arrow 0.8.0 release on the Java side https://issues.apache.org/jira/browse/ARROW-1948 that caused a problem here. I was able to find a workaround, but it required me to make a change to the `ArrowVectorAccessor` class. I'm not sure if this is something you would be ok putting in, or if you would prefer to wait until the next minor release to add the ArrayType support. The issue was that the Arrow spec states that if the validity buffer is empty, then that means that all the values are non-null. In Arrow 0.8.0, the C++/Python side started sending buffers this way, and the Arrow ListVector was not handling it properly, thinking instead that there were no valid values. The workaround I added here looks if the ListVector has a value count of > 0 and has an empty validity buffer. This means that all the values are non-null and it will allocate a new validity buffer with all bits set. For Arrow with non-udfs (toPandas and createDataFrame) this only needs to be done once, but for udfs each batch read will load new buffers into the arrow VectorSchemaRoot, so it needs to be checked after each read. The simplest place to put the workaround to cover these cases was to allow `ArrowVectorAccessor.isNullAt(int rowId)` to be overridden. Let me know what you guys think, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85499 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85499/testReport)** for PR 20114 at commit [`d2c5c2b`](https://github.com/apache/spark/commit/d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20070: SPARK-22896 Improvement in String interpolation
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20070 @srowen please do re-run the build. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159028207 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +86,28 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): --- End diff -- That's a good point that we could rename "params" to be clearer. How about "paramMaps"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support f...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/20114 [SPARK-22530][PYTHON][SQL] Adding Arrow support for ArrayType ## What changes were proposed in this pull request? This change adds `ArrayType` support for working with Arrow in pyspark when creating a DataFrame, calling `toPandas()`, and using vectorized `pandas_udf`. ## How was this patch tested? Added new Python unit tests using Array data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark arrow-ArrayType-support-SPARK-22530 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20114.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20114 commit 50fa54c5b04455729b019c660ab8e86c903bda44 Author: Bryan Cutler Date: 2017-11-15T23:44:23Z wip, toPandas works with pyarrow 0.7.1 commit a149352d0c60882bb6692cd43d2fb60c8dddb07b Author: Bryan Cutler Date: 2017-12-01T20:02:16Z createDataFrame test now working commit 36faab4d7a23421968e1885dc6f2f47ac20c0ce0 Author: Bryan Cutler Date: 2017-12-23T08:21:34Z using is_list to check type commit b0c79f108acf3ca91dd931bb9be45e4bbcf840a6 Author: Bryan Cutler Date: 2017-12-24T07:06:06Z Using a workaround for ListVector validity buffer, ArrowTests passing commit f1bc9a5d8ba09cf6d702269b2418697184ef5690 Author: Bryan Cutler Date: 2017-12-29T05:54:44Z ArrayType working in vectorized udfs commit d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b Author: Bryan Cutler Date: 2017-12-29T06:04:19Z fix import order --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r159028159 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,519 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'keep' (invalid data presented as an extra categorical feature) " + +"or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) + + protected def validateAndTransformSchema( + schema: StructType, dropLast: Boolean, keepInvalid: Boolean): StructType = { +val inputColNames = $(inputCols) +val outputColNames = $(outputCols) +val existingFields = schema.fields + +require(inputColNames.length == outputColNames.length, + s"The number of input columns ${inputColNames.length} must be the same as the number of " + +s"output columns ${outputColNames.length}.") + +// Input columns must be NumericType. +inputColNames.foreach(SchemaUtils.checkNumericType(schema, _)) + +// Prepares output columns with proper attributes by examining input columns. +val inputFields = $(inputCols).map(schema(_)) + +val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) => + OneHotEncoderCommon.transformOutputColumnSchema( +inputField, outputColName, dropLast, keepInvalid) +} +outputFields.foldLeft(schema) { case (newSchema, outputField) => + SchemaUtils.appendColumn(newSchema, outputField) +} + } +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0
[GitHub] spark issue #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel sav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20113 **[Test build #85498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85498/testReport)** for PR 20113 at commit [`408bfed`](https://github.com/apache/spark/commit/408bfed88cd237e5adbf42bd5b4fd2ccf875b5bd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20113: [SPARK-22905][ML][FollowUp] Fix GaussianMixtureMo...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20113 [SPARK-22905][ML][FollowUp] Fix GaussianMixtureModel save ## What changes were proposed in this pull request? make sure model data is stored in order. @WeichenXu123 ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark gmm_save Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20113 commit 408bfed88cd237e5adbf42bd5b4fd2ccf875b5bd Author: Zheng RuiFeng Date: 2017-12-29T06:01:51Z create pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19843: [SPARK-22644][ML][TEST] Make ML testsuite support...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19843#discussion_r159027996 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTest.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import java.io.File + +import org.scalatest.Suite + +import org.apache.spark.SparkContext +import org.apache.spark.ml.Transformer +import org.apache.spark.sql.{DataFrame, Encoder, Row} +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.streaming.StreamTest +import org.apache.spark.sql.test.TestSparkSession +import org.apache.spark.util.Utils + +trait MLTest extends StreamTest with TempDirectory { self: Suite => + + @transient var sc: SparkContext = _ + @transient var checkpointDir: String = _ + + protected override def createSparkSession: TestSparkSession = { +new TestSparkSession(new SparkContext("local[2]", "MLlibUnitTest", sparkConf)) + } + + override def beforeAll(): Unit = { +super.beforeAll() +sc = spark.sparkContext +checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, "checkpoints").toString +sc.setCheckpointDir(checkpointDir) + } + + override def afterAll() { --- End diff -- Actually, it's worse than this. I see a bunch of failures when I run multiple test suites at once, even when doing `sbt clean package` beforehand and without any tests which fail by themselves. Will test on master and complain on the dev list if it's an issue. (No need to respond here) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20111#discussion_r159027858 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala --- @@ -51,31 +48,31 @@ class FeatureHasherSuite extends SparkFunSuite } test("feature hashing") { --- End diff -- Rearranged this test so it checks each row independently. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20111#discussion_r159027879 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala --- @@ -37,21 +36,28 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau } test("hashingTF") { --- End diff -- ditto: rearranged to do validity check per-row --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19683 Now I feel it's a little hacky to introduce `Generate.unrequiredChildOuput`, as the attribute may get replaced by something else during optimization. How about `Generate.unreqiredChildIndex`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20110 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20110 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20110 I confirmed the test came to pass after the patch in my local environment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20010 **[Test build #85497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85497/testReport)** for PR 20010 at commit [`86e1929`](https://github.com/apache/spark/commit/86e1929c490861d9e93ef34abd52c442f99f31a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r159025626 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,519 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'keep' (invalid data presented as an extra categorical feature) " + +"or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) + + protected def validateAndTransformSchema( + schema: StructType, dropLast: Boolean, keepInvalid: Boolean): StructType = { +val inputColNames = $(inputCols) +val outputColNames = $(outputCols) +val existingFields = schema.fields + +require(inputColNames.length == outputColNames.length, + s"The number of input columns ${inputColNames.length} must be the same as the number of " + +s"output columns ${outputColNames.length}.") + +// Input columns must be NumericType. +inputColNames.foreach(SchemaUtils.checkNumericType(schema, _)) + +// Prepares output columns with proper attributes by examining input columns. +val inputFields = $(inputCols).map(schema(_)) + +val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) => + OneHotEncoderCommon.transformOutputColumnSchema( +inputField, outputColName, dropLast, keepInvalid) +} +outputFields.foldLeft(schema) { case (newSchema, outputField) => + SchemaUtils.appendColumn(newSchema, outputField) +} + } +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0,
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20110 LGTM for the change, but I'm not sure whether the test was indeed triggered or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85496/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20112 **[Test build #85496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85496/testReport)** for PR 20112 at commit [`83bb7de`](https://github.com/apache/spark/commit/83bb7ded0d58d4173671904a452039b57bcbea3d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReadable,` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20111 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85494/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20111 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20111 **[Test build #85494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85494/testReport)** for PR 20111 at commit [`12b3dcf`](https://github.com/apache/spark/commit/12b3dcf13f90ea00c2a12ec186a5f3277e812095). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class BinarizerSuite extends MLTest with DefaultReadWriteTest ` * `class BucketedRandomProjectionLSHSuite extends MLTest with DefaultReadWriteTest ` * `class BucketizerSuite extends MLTest with DefaultReadWriteTest ` * `class ChiSqSelectorSuite extends MLTest with DefaultReadWriteTest ` * `class CountVectorizerSuite extends MLTest with DefaultReadWriteTest ` * `class DCTSuite extends MLTest with DefaultReadWriteTest ` * `class ElementwiseProductSuite extends MLTest with DefaultReadWriteTest ` * `class FeatureHasherSuite extends MLTest with DefaultReadWriteTest ` * `class HashingTFSuite extends MLTest with DefaultReadWriteTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20025 My understanding is that the reflection was used because we might use a different version of hive then we didn't control what it was done inside the `super.init`. However, after we inlined the hive code, it is safe to call the `super.init` method. This is a cleaner way to fix the referred and other potential bugs, IMO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20097 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20097 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85491/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20097 **[Test build #85491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85491/testReport)** for PR 20097 at commit [`9ffb92c`](https://github.com/apache/spark/commit/9ffb92c28014a0469cb8e3f77bea2d7100a9416f). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20112 **[Test build #85496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85496/testReport)** for PR 20112 at commit [`83bb7de`](https://github.com/apache/spark/commit/83bb7ded0d58d4173671904a452039b57bcbea3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20112: [SPARK-22734][ML][PySpark] Added Python API for V...
GitHub user MrBago opened a pull request: https://github.com/apache/spark/pull/20112 [SPARK-22734][ML][PySpark] Added Python API for VectorSizeHint. (Please fill in changes proposed in this fix) Python API for VectorSizeHint Transformer. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) doc-tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MrBago/spark vectorSizeHint-PythonAPI Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20112 commit 83bb7ded0d58d4173671904a452039b57bcbea3d Author: Bago Amirbekian Date: 2017-12-29T03:05:53Z Added Python API for VectorSizeHint. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22922][ML][PySpark] Pyspark portion of the...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159024163 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +86,28 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): --- End diff -- We couldn't use `fit` because it's going to have the same signature as the existing `fit` method but return a different type, (Iterator[(Int, Model)] instead of Seq[Model]). I was trying to be consistent with Estimator.fit which uses the name `params` which is different than the name of the same argument in Scala :/. Happy to change it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159023958 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +86,28 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): +""" +Fits a model to the input dataset for each param map in params. + +:param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`. +:param params: A Sequence of param maps. +:return: A thread safe iterable which contains one model for each param map. Each + call to `next(modelIterator)` will return `(index, model)` where model was fit + using `params[index]`. Params maps may be fit in an order different than their + order in params. + +.. note:: DeveloperApi +.. note:: Experimental +""" +estimator = self.copy() + +def fitSingleModel(index): +return estimator.fit(dataset, params[index]) + +return FitMultipleIterator(fitSingleModel, len(params)) --- End diff -- The idea is you should be able to do something like this: ``` pool = ... modelIter = estimator.fitMultiple(params) rng = range(len(params)) for index, model in pool.imap_unordered(lambda _: next(modelIter), rng): pass ``` That's pretty much how I've set up corss validator to use it, https://github.com/apache/spark/pull/20058/files/fe3d6bddc3e9e50febf706d7f22007b1e0d58de3#diff-cbc8c36bfdd245e4e4d5bd27f9b95359R292 The reason for set it up this way is so that, when appropriate, Estimators can implement their own optimized `fitMultiple` methods that just need to return an "iterator", A class with `__iter__` and `__next__`. For examples models that use `maxIter` and `maxDepth` params. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19977 **[Test build #85495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85495/testReport)** for PR 19977 at commit [`57a9d1e`](https://github.com/apache/spark/commit/57a9d1e9da21d56873c97eac08797499199a0c7b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user advancedxy commented on the issue: https://github.com/apache/spark/pull/20082 ping @cloud-fan @jiangxb1987 @zsxwing, I think it's ready for merging. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20111: [SPARK-22883][ML][TEST] Streaming tests for spark.ml.fea...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20111 **[Test build #85494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85494/testReport)** for PR 20111 at commit [`12b3dcf`](https://github.com/apache/spark/commit/12b3dcf13f90ea00c2a12ec186a5f3277e812095). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20111#discussion_r159022777 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala --- @@ -17,26 +17,23 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite import org.apache.spark.ml.attribute.AttributeGroup import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamsSuite -import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} import org.apache.spark.ml.util.TestingUtils._ -import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.functions.col import org.apache.spark.sql.types._ -class FeatureHasherSuite extends SparkFunSuite - with MLlibTestSparkContext - with DefaultReadWriteTest { +class FeatureHasherSuite extends MLTest with DefaultReadWriteTest { import testImplicits._ import HashingTFSuite.murmur3FeatureIdx - implicit private val vectorEncoder = ExpressionEncoder[Vector]() + implicit private val vectorEncoder: ExpressionEncoder[Vector] = ExpressionEncoder[Vector]() --- End diff -- scala style --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20111#discussion_r159022766 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ElementwiseProductSuite.scala --- @@ -17,13 +17,31 @@ package org.apache.spark.ml.feature -import org.apache.spark.SparkFunSuite -import org.apache.spark.ml.linalg.Vectors -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.sql.Row -class ElementwiseProductSuite - extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class ElementwiseProductSuite extends MLTest with DefaultReadWriteTest { + + import testImplicits._ + + test("streaming transform") { --- End diff -- No existing unit test to use --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20111#discussion_r159022677 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala --- @@ -163,18 +162,19 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext assert(expected.selectedFeatures === actual.selectedFeatures) } } -} -object ChiSqSelectorSuite { - - private def testSelector(selector: ChiSqSelector, dataset: Dataset[_]): ChiSqSelectorModel = { -val selectorModel = selector.fit(dataset) -selectorModel.transform(dataset).select("filtered", "topFeature").collect() - .foreach { case Row(vec1: Vector, vec2: Vector) => + private def testSelector(selector: ChiSqSelector, data: Dataset[_]): ChiSqSelectorModel = { --- End diff -- Moved from object to class b/c this needed testTransformer from the MLTest mix-in --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20111#discussion_r159022657 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala --- @@ -98,6 +97,21 @@ class BucketedRandomProjectionLSHSuite MLTestingUtils.checkCopyAndUids(brp, brpModel) } + test("BucketedRandomProjectionLSH: streaming transform") { --- End diff -- No existing test to use --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20111: [SPARK-22883][ML][TEST] Streaming tests for spark...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/20111 [SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H ## What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: * BinarizerSuite * BucketedRandomProjectionLSHSuite * BucketizerSuite * ChiSqSelectorSuite * CountVectorizerSuite * DCTSuite.scala * ElementwiseProductSuite * FeatureHasherSuite * HashingTFSuite ## How was this patch tested? It tests itself because it is a bunch of tests! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark SPARK-22883-streaming-featureAM Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20111.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20111 commit 12b3dcf13f90ea00c2a12ec186a5f3277e812095 Author: Joseph K. Bradley Date: 2017-12-29T03:31:17Z added streaming tests for first quarter of spark.ml.feature --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20010: [SPARK-22826][SQL] findWiderTypeForTwo Fails over Struct...
Github user gczsjdy commented on the issue: https://github.com/apache/spark/pull/20010 Seems not a regular error? @bdrillard Maybe you can push a commit and trigger the test again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20020 **[Test build #85493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85493/testReport)** for PR 20020 at commit [`18ec016`](https://github.com/apache/spark/commit/18ec01638b9da7f8150e3ea35c2876d6d1f41f3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20109 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20109 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85490/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20109: [SPARK-22891][SQL] Make hive client creation thread safe
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20109 **[Test build #85490 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85490/testReport)** for PR 20109 at commit [`163d344`](https://github.com/apache/spark/commit/163d3443681af2c5ff246ecc546355934c0f6dbb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159021297 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +86,28 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): --- End diff -- Check out the discussion on the JIRA and the linked design doc. Basically, we need the same argument types but different return types from what the current fit() method provides. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159021231 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +74,24 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): +""" +Fits a model to the input dataset for each param map in params. + +:param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`. +:param params: A list/tuple of param maps. --- End diff -- Is there another Sequence type this could be other than list or tuple? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20025 @rxin Could you please to review this? Thanks. In my opinion we can create a new or follow-up PR if refactor is necessary. This PR is to fix the bug about the Session Timeout Checker does not work currently. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20110 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20110 **[Test build #85492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85492/testReport)** for PR 20110 at commit [`6b73dd8`](https://github.com/apache/spark/commit/6b73dd8b2d47f8ae218bbab4eeb696684cdac138). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20110 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85492/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19979: [SPARK-22881][ML][TEST] ML regression package testsuite ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19979 @jkbradley > When there has been a shuffle, it is likely the Rows will not follow a fixed order. Agreed. But we can make sure it generate fix order from the last shuffle position in the physical plan RDD lineage. Those model which works like `map` transformation, I think it can make sure output row order to be exactly the same with input row order. > test statistics (such as min/max ) on global transformer output This is also used in some tests, such as "predictRaw and predictProbability" testcase in `DecisionTreeClassifierSuite" > For comparing results with expected values, I much prefer for those values to be in a column in the original input dataset. Agreed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159020468 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +86,28 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): +""" +Fits a model to the input dataset for each param map in params. + +:param dataset: input dataset, which is an instance of :py:class:`pyspark.sql.DataFrame`. +:param params: A Sequence of param maps. +:return: A thread safe iterable which contains one model for each param map. Each + call to `next(modelIterator)` will return `(index, model)` where model was fit + using `params[index]`. Params maps may be fit in an order different than their + order in params. + +.. note:: DeveloperApi +.. note:: Experimental +""" +estimator = self.copy() + +def fitSingleModel(index): +return estimator.fit(dataset, params[index]) + +return FitMultipleIterator(fitSingleModel, len(params)) --- End diff -- So whats the benefit of `FitMultipleIterator` v.s. using `imap_unordered`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20058: [SPARK-22126][ML][PySpark] Pyspark portion of the...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/20058#discussion_r159020312 --- Diff: python/pyspark/ml/base.py --- @@ -47,6 +86,28 @@ def _fit(self, dataset): """ raise NotImplementedError() +@since("2.3.0") +def fitMultiple(self, dataset, params): --- End diff -- So in Scala Spark we use the `fit` function rather than separate functions. Also the `params` name is different than the Scala one. Any reason for the difference? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/20025 @liufengdb I think the class `SessionManager.java` is merged from Hive originally, and in Spark we redesigned it by adding `SparkSQLSessionManager.scala` with no affect to `SessionManager.java` : `val sparkSqlSessionManager = new SparkSQLSessionManager(hiveServer, sqlContext) setSuperField(this, "sessionManager", sparkSqlSessionManager)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19977 ah, ok. good catch. I'll fix soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20107: [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20107 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85485/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20107: [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20107 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20107: [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20107 **[Test build #85485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85485/testReport)** for PR 20107 at commit [`a335000`](https://github.com/apache/spark/commit/a335000475f71eff0055ccee91e9d486f50288fd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19535#discussion_r159020029 --- Diff: python/pyspark/streaming/flume.py --- @@ -54,8 +54,13 @@ def createStream(ssc, hostname, port, :param bodyDecoder: A function used to decode body (default is utf8_decoder) :return: A DStream object -.. note:: Deprecated in 2.3.0 +.. note:: Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. +See SPARK-22142. """ +warnings.warn( --- End diff -- Sure, I took a quick look and I think this one is actually not being tested and seems that's why .. will double check and take a closer look tonight (KST). I have seen few mistakes about this so far and .. I am working on Python coverage BTW - https://issues.apache.org/jira/browse/SPARK-7721 Anyway, it was my stupid mistake. Thanks .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/20110 Thank you! Let's also check the build result to make sure `pyspark.streaming.tests.FlumePollingStreamTests` is indeed triggered (I hit this issue while running this test). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20110 cc @ueshin too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19535#discussion_r159019845 --- Diff: python/pyspark/streaming/flume.py --- @@ -54,8 +54,13 @@ def createStream(ssc, hostname, port, :param bodyDecoder: A function used to decode body (default is utf8_decoder) :return: A DStream object -.. note:: Deprecated in 2.3.0 +.. note:: Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. +See SPARK-22142. """ +warnings.warn( --- End diff -- thank you :) It will be good to also check why master build does not fail since python should complain about it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20110 **[Test build #85492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85492/testReport)** for PR 20110 at commit [`6b73dd8`](https://github.com/apache/spark/commit/6b73dd8b2d47f8ae218bbab4eeb696684cdac138). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20110 cc @yhuai. Thank you for catching this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20110 [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnings namespace in flume.py ## What changes were proposed in this pull request? This PR explicitly imports the missing `warnings` in `flume.py`. ## How was this patch tested? Manually tested. ```python >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) >>> from pyspark.streaming import flume >>> flume.FlumeUtils.createStream(None, None, None) Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/streaming/flume.py", line 60, in createStream warnings.warn( NameError: global name 'warnings' is not defined ``` ```python >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) >>> from pyspark.streaming import flume >>> flume.FlumeUtils.createStream(None, None, None) /.../spark/python/pyspark/streaming/flume.py:65: DeprecationWarning: Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. See SPARK-22142. DeprecationWarning) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-22313-followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20110 commit 6b73dd8b2d47f8ae218bbab4eeb696684cdac138 Author: hyukjinkwon Date: 2017-12-29T02:27:15Z Explicitly import warnings in flume.py --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org