[GitHub] spark issue #17525: [SPARK-20209][SS] Execute next trigger immediately if pr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17525 **[Test build #75504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75504/testReport)** for PR 17525 at commit [`50f0195`](https://github.com/apache/spark/commit/50f0195a4eee34db813c9040437de95796c577cc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17455: [Spark-20044][Web UI] Support Spark UI behind fro...
Github user okoethibm commented on a diff in the pull request: https://github.com/apache/spark/pull/17455#discussion_r109588861 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala --- @@ -132,7 +132,13 @@ private[deploy] class Master( webUi.bind() masterWebUiUrl = "http://"; + masterPublicAddress + ":" + webUi.boundPort if (reverseProxy) { - masterWebUiUrl = conf.get("spark.ui.reverseProxyUrl", masterWebUiUrl) + conf.getOption("spark.ui.reverseProxyUrl") map { reverseProxyUrl => +val proxyUrlNoSlash = reverseProxyUrl.stripSuffix("/") +System.setProperty("spark.ui.proxyBase", proxyUrlNoSlash) +// If the master URL has a path component, it must end with a slash. +// Otherwise the browser generates incorrect relative links +masterWebUiUrl = proxyUrlNoSlash + "/" --- End diff -- If we have a front-end reverse proxy path like mydomain.com:80/path/to/spark, then the spark.ui.proxyBase property (prefix for URL generation) *must not* include a trailing slash, the way it's used in UiUtils, like prependBaseUri("/static/bootstrap.min.css"). However, the explicit URL address pointing to the master UI page (e.g. the back-lilnk from workers to master, which masterWebUiUrl feeds into) *must* include a trailing slash, if it has a path component, because the master UI page contains relative liks like "app?...". Without a path component, the trailing slash does not matter for resolving these links, but with a path component, they must resolve to mydomain.com:80/path/to/spark/app (*not* mydomain.com:80/path/to/app), therefore the base URL must have a trailing slash. The code is intended to work regardless whether spark.ui.reverseProxyUrl was specified with or without a trailing slash, so the safe way to ensure a single trailing slash was to first strip an optional slash and then add one. Your suggestion would double the slash if there is one specified in the config. If there's a clean way to move the stripSuffix handling into the config itself, that would make the code prettier, though --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17526: [SPARKR][DOC] update doc for fpgrowth
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17526 **[Test build #75503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75503/testReport)** for PR 17526 at commit [`e4e03ea`](https://github.com/apache/spark/commit/e4e03eaf98581da92cdf29d93c602384ad82ad36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17526: [SPARKR][DOC] update doc for fpgrowth
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/17526 [SPARKR][DOC] update doc for fpgrowth ## What changes were proposed in this pull request? minor update @zero323 You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rfpgrowthfollowup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17526.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17526 commit 5fee9e70b0ca31c5a4e55b66f908fa56b205ead5 Author: Felix Cheung Date: 2017-04-04T06:53:38Z update doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...
Github user umehrot2 commented on the issue: https://github.com/apache/spark/pull/17445 Jenkins test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17525: [SPARK-20209][SS] Execute next trigger immediatel...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/17525 [SPARK-20209][SS] Execute next trigger immediately if previous batch took longer than trigger interval ## What changes were proposed in this pull request? For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval. In this PR, I modified the ProcessingTimeExecutor to do so. ## How was this patch tested? Added new unit tests to comprehensively test this behavior. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-20209 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17525 commit 50f0195a4eee34db813c9040437de95796c577cc Author: Tathagata Das Date: 2017-04-04T06:48:00Z Removed delay from trigger executor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17170 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17170 merged to master. @zero323 could you follow up with vignettes and programming guide update please - we need them for the 2.2.0 release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17524: [SPARK-19235] [SQL] [TEST] [FOLLOW-UP] Enable Test Cases...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17524 **[Test build #75502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75502/testReport)** for PR 17524 at commit [`427741f`](https://github.com/apache/spark/commit/427741f548ff4469d62906546655f7ec96564ced). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17455: [Spark-20044][Web UI] Support Spark UI behind fro...
Github user okoethibm commented on a diff in the pull request: https://github.com/apache/spark/pull/17455#discussion_r109586326 --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala --- @@ -157,7 +157,9 @@ private[deploy] class ExecutorRunner( // Add webUI log urls val baseUrl = if (conf.getBoolean("spark.ui.reverseProxy", false)) { - s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType=" + // TODO get from master? --- End diff -- Oops, that was a leftover from testing. In fact, the code is simpler when consistently get the reverse proxy URL from the config along with the reverse proxy flag, requiring both settings to be consistently set on all nodes. I briefly considered a communication extension to send the master (reverse proxy) URL to the executors, but felt it didn't really help --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Comma...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17394 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17524: [SPARK-19235] [SQL] [TEST] [FOLLOW-UP] Enable Test Cases...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17524 **[Test build #75501 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75501/testReport)** for PR 17524 at commit [`c102187`](https://github.com/apache/spark/commit/c1021871bdd000e87ff0906af434bceac3129b2b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Commands Usi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17394 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17524: [SPARK-19235] [SQL] [TEST] [FOLLOW-UP] Enable Test Cases...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17524 **[Test build #75500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75500/testReport)** for PR 17524 at commit [`8382228`](https://github.com/apache/spark/commit/83822289d790a7ebedf8634df6bbdf9cebeb5057). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17524: [SPARK-19235] [SQL] [TEST] [FOLLOW-UP] Enable Tes...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/17524 [SPARK-19235] [SQL] [TEST] [FOLLOW-UP] Enable Test Cases in DDLSuite with Hive Metastore ### What changes were proposed in this pull request? This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks: - Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore. - Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog. - Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables. ### How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark cleanupDDLSuite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17524.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17524 commit 83822289d790a7ebedf8634df6bbdf9cebeb5057 Author: Xiao Li Date: 2017-04-04T06:17:06Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17480: [SPARK-20079][Core][yarn] Re registration of AM hangs sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17480 **[Test build #75499 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75499/testReport)** for PR 17480 at commit [`f54c9ae`](https://github.com/apache/spark/commit/f54c9ae77bfdd3756e120f764aa443500ad6fcf8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17469#discussion_r109575589 --- Diff: python/pyspark/sql/column.py --- @@ -303,8 +333,25 @@ def isin(self, *cols): desc = _unary_op("desc", "Returns a sort expression based on the" " descending order of the given column name.") -isNull = _unary_op("isNull", "True if the current expression is null.") -isNotNull = _unary_op("isNotNull", "True if the current expression is not null.") +_isNull_doc = ''' True if the current expression is null. Often combined with + :func:`DataFrame.filter` to select rows with null values. + + >>> df2.collect() + [Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)] + >>> df2.filter( df2.height.isNull ).collect() + [Row(name=u'Alice', height=None)] + ''' +_isNotNull_doc = ''' True if the current expression is null. Often combined with --- End diff -- ^ cc @holdenk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/17480#discussion_r109575470 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -249,7 +249,9 @@ private[spark] class ExecutorAllocationManager( * yarn-client mode when AM re-registers after a failure. */ def reset(): Unit = synchronized { -initializing = true +if (maxNumExecutorsNeeded() == 0) { --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17469#discussion_r109575023 --- Diff: python/pyspark/sql/column.py --- @@ -250,11 +250,39 @@ def __iter__(self): raise TypeError("Column is not iterable") # string methods +_rlike_doc = """ Return a Boolean :class:`Column` based on a regex match.\n --- End diff -- Could you maybe give a shot with this patch - https://github.com/map222/spark/compare/patterson-documentation...HyukjinKwon:rlike-docstring.patch ? I double checked it produces ![2017-04-04 1 23 30](https://cloud.githubusercontent.com/assets/6477701/24641412/84765e9c-193a-11e7-85d5-9745ea151c12.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17251 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75498/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17251 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17251 **[Test build #75498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75498/testReport)** for PR 17251 at commit [`2150ce5`](https://github.com/apache/spark/commit/2150ce552a7a02d656329761e04a7fcb38e5e648). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17469: [SPARK-20132][Docs] Add documentation for column string ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17469 It might be better to run`./dev/lint-python` locally if possible. There will catch more of minor nits ahead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17469#discussion_r109574284 --- Diff: python/pyspark/sql/column.py --- @@ -303,8 +333,25 @@ def isin(self, *cols): desc = _unary_op("desc", "Returns a sort expression based on the" " descending order of the given column name.") -isNull = _unary_op("isNull", "True if the current expression is null.") -isNotNull = _unary_op("isNotNull", "True if the current expression is not null.") +_isNull_doc = ''' True if the current expression is null. Often combined with --- End diff -- I just found a good reference in pep8 > For triple-quoted strings, always use double quote characters to be consistent with the docstring convention in PEP 257 https://www.python.org/dev/peps/pep-0008/#string-quotes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17469#discussion_r109574278 --- Diff: python/pyspark/sql/column.py --- @@ -303,8 +333,25 @@ def isin(self, *cols): desc = _unary_op("desc", "Returns a sort expression based on the" " descending order of the given column name.") -isNull = _unary_op("isNull", "True if the current expression is null.") -isNotNull = _unary_op("isNotNull", "True if the current expression is not null.") +_isNull_doc = ''' True if the current expression is null. Often combined with + :func:`DataFrame.filter` to select rows with null values. + + >>> df2.collect() + [Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)] + >>> df2.filter( df2.height.isNull ).collect() + [Row(name=u'Alice', height=None)] + ''' +_isNotNull_doc = ''' True if the current expression is null. Often combined with --- End diff -- Up to my knowledge, both docstrings comply pep8 up to my knowledge, ``` """ ... """ ``` or ``` """ ... """ ``` but for this case, it seems a separate variable. Personally, I prefer ```python _isNull_doc = """ True if the current expression is null. Often combined with :func:`DataFrame.filter` to select rows with null values. >>> df2.collect() [Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)] >>> df2.filter( df2.height.isNull ).collect() [Row(name=u'Alice', height=None)] """ ``` but I could not find a formal reference to support this idea (in case that it is a separate variable) and I am not supposed to decide this. So, I am fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Commands Usi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17394 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75497/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Commands Usi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17394 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Commands Usi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17394 **[Test build #75497 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75497/testReport)** for PR 17394 at commit [`862a4d7`](https://github.com/apache/spark/commit/862a4d7a61e48ff7b0e1d52ea0416bc57a4d6a33). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17459 @johnc1231 The prototype I did: https://github.com/apache/spark/compare/master...viirya:general-toblockmatrix?expand=1 Maybe you can take a look and see if it is useful to you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17494 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75496/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17494 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17494 **[Test build #75496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75496/testReport)** for PR 17494 at commit [`fbcc1fe`](https://github.com/apache/spark/commit/fbcc1fe1c8e2652dc54c2ebfacce01a3f69449a2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17505: [SPARK-20187][SQL] Replace loadTable with moveFil...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17505#discussion_r109567793 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -694,12 +694,25 @@ private[hive] class HiveClientImpl( tableName: String, replace: Boolean, isSrcLocal: Boolean): Unit = withHiveState { -shim.loadTable( - client, - new Path(loadPath), - tableName, - replace, - isSrcLocal) +val tbl = client.getTable(tableName) +val fs = tbl.getDataLocation.getFileSystem(conf) +if (replace) { --- End diff -- [`loadTable` is calling `replaceFiles` when `replace` is true. ](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1640), instead of calling `Hive.copyFiles`. `replaceFiles` is based on the calls of `moveFile`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75494/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Commands Usi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17394 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17520 **[Test build #75494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75494/testReport)** for PR 17520 at commit [`0bab4fd`](https://github.com/apache/spark/commit/0bab4fd335279accca5e90ed4ecdb1d7ea99383e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...
Github user johnc1231 commented on the issue: https://github.com/apache/spark/pull/17459 Alright, I agree with this. We'll switch off Dense or Sparse matrix backings based on what the type of the first vector in the iterator is. I'd be happy to take on making these adjustments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17506: [SPARK-20189][DStream] Fix spark kinesis testcases to re...
Github user yssharma commented on the issue: https://github.com/apache/spark/pull/17506 The Scala style check fail because of the double spaced lines probably. But that's how the existing code was so thought of keeping it that way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17459 > I considered having toBlockMatrix check if the rows of IndexedRowMatrix were dense or sparse, but there is no guarantee of consistency. Like, an IndexedRowMatrix could be a mix of Dense and Sparse Vectors. In that case, it would not be clear what type of BlockMatrix to create. A decent approximation of this would be to just decide the matrix type based on the first vector we look at in the iterator we get from groupByKey, creating a mix of Dense and Sparse matrices in a BlockMatrix, but I still think it's best to be explicit. Also, we currently have the description of toBlockMatrix promising to make a BlockMatrix backed by instances of SparseMatrix, so we have made promises to users about the composition of the BlockMatrix before. I don't mean we don't care about it. I meant there is no guarantee that `BlockMatrix` is purely consisted of `DenseMatrix` or `SparseMatrix`. It could be a mix of them. Thus, we can have a `toBlockMatrix` which creates a `BlockMatrix` which is a mix of `DenseMatrix` and `SparseMatrix`. A block in a `BlockMatrix` can be a `DenseMatrix` and `SparseMatrix`, depending on the ratio of values in the block. Yes, it is like `a decent approximation` you talked. For a `IndexedRowMatrix` completely consisted of `DenseVector`, this `toBlockMatrix` definitely returns a `BlockMatrix` backed by `DenseMatrix`. For other cases, `DenseMatrix` might not be best choice for all blocks in the `BlockMatrix`, as many blocks will be sparse. About the promise that `toBlockMatrix` makes a `BlockMatrix` backed by instances of `SparseMatrix`, as I said it is not explicitly bound to the API level. I think it is not a big problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17467: [SPARK-20140][DStream] Remove hardcoded kinesis retry wa...
Github user yssharma commented on the issue: https://github.com/apache/spark/pull/17467 @srowen - Could I get some love here as well. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15332: [SPARK-10364][SQL] Support Parquet logical type TIMESTAM...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/15332 Thanks a lot @ueshin @viirya @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17251 **[Test build #75498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75498/testReport)** for PR 17251 at commit [`2150ce5`](https://github.com/apache/spark/commit/2150ce552a7a02d656329761e04a7fcb38e5e648). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17251 Retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...
Github user johnc1231 commented on the issue: https://github.com/apache/spark/pull/17459 @viirya I think we definitely care about giving users the ability to make either dense or sparse Block matrices. I made a 100k by 10k IndexedRowMatrix of random doubles, then converted it to a BlockMatrix to multiply it by its transpose. With the current toBlockMatrix implementation, that took 252 seconds on 128 cores. With my implementation, that took 35 seconds. The backing of a BlockMatrix matters a lot, and we need to let users be explicit about it. I considered having toBlockMatrix check if the rows of IndexedRowMatrix were dense or sparse, but there is no guarantee of consistency. Like, an IndexedRowMatrix could be a mix of Dense and Sparse Vectors. In that case, it would not be clear what type of BlockMatrix to create. A decent approximation of this would be to just decide the matrix type based on the first vector we look at in the iterator we get from groupByKey, but I still think it's best to be explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17494 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75495/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17494 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17494 **[Test build #75495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75495/testReport)** for PR 17494 at commit [`8936880`](https://github.com/apache/spark/commit/8936880bafd8a8520011e663c0edc3b428b9160f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17523: [SPARK-20064][PySpark]
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17523 (it would be nicer if the title is fixed to indicate what it proposes in short) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user Downchuck commented on the issue: https://github.com/apache/spark/pull/16347 Is there anyone on the Spark team taking this up? This bug is painful; it's saddened a hundred TB of data I stacked up, and I'm really trying to avoid more manual work. "INSERT OVERWRITE TABLE ... DISTRIBUTE BY ... SORT BY" is how I live my life these days. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Comma...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17394#discussion_r109558929 --- Diff: sql/core/src/test/resources/sql-tests/results/describe.sql.out --- @@ -1,205 +1,259 @@ -- Automatically generated by SQLQueryTestSuite --- Number of queries: 14 +-- Number of queries: 31 -- !query 0 -CREATE TABLE t (a STRING, b INT, c STRING, d STRING) USING parquet PARTITIONED BY (c, d) COMMENT 'table_comment' +CREATE TABLE t (a STRING, b INT, c STRING, d STRING) USING parquet + PARTITIONED BY (c, d) CLUSTERED BY (a) SORTED BY (b ASC) INTO 2 BUCKETS + COMMENT 'table_comment' -- !query 0 schema struct<> -- !query 0 output -- !query 1 -ALTER TABLE t ADD PARTITION (c='Us', d=1) +CREATE TEMPORARY VIEW temp_v AS SELECT * FROM t -- !query 1 schema struct<> -- !query 1 output -- !query 2 -DESCRIBE t +CREATE TEMPORARY VIEW temp_Data_Source_View + USING org.apache.spark.sql.sources.DDLScanSource + OPTIONS ( +From '1', +To '10', +Table 'test1') -- !query 2 schema -struct +struct<> -- !query 2 output -# Partition Information + + + +-- !query 3 +CREATE VIEW v AS SELECT * FROM t +-- !query 3 schema +struct<> +-- !query 3 output + + + +-- !query 4 +ALTER TABLE t ADD PARTITION (c='Us', d=1) +-- !query 4 schema +struct<> +-- !query 4 output + + + +-- !query 5 +DESCRIBE t +-- !query 5 schema +struct +-- !query 5 output # col_name data_type comment a string b int c string -c string d string +# Partition Information +# col_name data_type comment +c string d string --- !query 3 -DESC t --- !query 3 schema +-- !query 6 +DESC default.t +-- !query 6 schema struct --- !query 3 output -# Partition Information +-- !query 6 output # col_name data_type comment a string b int c string -c string d string +# Partition Information +# col_name data_type comment +c string d string --- !query 4 +-- !query 7 DESC TABLE t --- !query 4 schema +-- !query 7 schema struct --- !query 4 output -# Partition Information +-- !query 7 output # col_name data_type comment a string b int c string -c string d string +# Partition Information +# col_name data_type comment +c string d string --- !query 5 +-- !query 8 DESC FORMATTED t --- !query 5 schema +-- !query 8 schema struct --- !query 5 output -# Detailed Table Information -# Partition Information -# Storage Information +-- !query 8 output # col_name data_type comment -Comment: table_comment
[GitHub] spark issue #17394: [SPARK-20067] [SQL] Unify and Clean Up Desc Commands Usi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17394 **[Test build #75497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75497/testReport)** for PR 17394 at commit [`862a4d7`](https://github.com/apache/spark/commit/862a4d7a61e48ff7b0e1d52ea0416bc57a4d6a33). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15332: [SPARK-10364][SQL] Support Parquet logical type T...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15332 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15332: [SPARK-10364][SQL] Support Parquet logical type TIMESTAM...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/15332 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17494 **[Test build #75496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75496/testReport)** for PR 17494 at commit [`fbcc1fe`](https://github.com/apache/spark/commit/fbcc1fe1c8e2652dc54c2ebfacce01a3f69449a2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17494 **[Test build #75495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75495/testReport)** for PR 17494 at commit [`8936880`](https://github.com/apache/spark/commit/8936880bafd8a8520011e663c0edc3b428b9160f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17494#discussion_r109557018 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala --- @@ -56,7 +56,7 @@ object Correlation { * Here is how to access the correlation coefficient: * {{{ *val data: Dataset[Vector] = ... - *val Row(coeff: Matrix) = Statistics.corr(data, "value").head + *val Row(coeff: Matrix) = Correlation.corr(data, "value").head *// coeff now contains the Pearson correlation matrix. * }}} * --- End diff -- oh, right. fixed. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17512: [SPARK-20196][PYTHON][SQL] update doc for catalog functi...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17512 will update after #17518 + changes to R doc too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17494#discussion_r109556837 --- Diff: python/pyspark/ml/stat.py --- @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol): return _java2py(sc, javaTestObj.test(*args)) +class Correlation(object): +""" +.. note:: Experimental + +Compute the correlation matrix for the input dataset of Vectors using the specified method. +Methods currently supported: `pearson` (default), `spearman`. --- End diff -- Sounds good. Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16906 +1 on that, we do have the log on the R side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17459 I've done some prototype locally to generalize this change to `SparseMatrix`. During that, I have a thought that do we have the limit that all Matrix in `BlockMatrix` need to be the same kind of Matrix (i.e., `DenseMatrix` or `SparseMatrix`)? Actually we can easily have only one `toBlockMatrix` method which creates a `BlockMatrix` including both `DenseMatrix` or `SparseMatrix`, depending if the blocks are sparse or not. From the external view of this API, we don't have an explicit difference between `SparseMatrix`-backed and `DenseMatrix`-backed `BlockMatrix`s. We don't have subclasses for it, nor any property can be used to know about it. Doesn't it mean we don't really care about it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17415 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17415: [SPARK-19408][SQL] filter estimation on two columns of s...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17415 Thanks, Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17520 **[Test build #75494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75494/testReport)** for PR 17520 at commit [`0bab4fd`](https://github.com/apache/spark/commit/0bab4fd335279accca5e90ed4ecdb1d7ea99383e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17415#discussion_r109554814 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -550,6 +565,220 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo Some(percent.toDouble) } + /** + * Returns a percentage of rows meeting a binary comparison expression containing two columns. + * In SQL queries, we also see predicate expressions involving two columns + * such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. + * Note that, if column-1 and column-2 belong to different tables, then it is a join + * operator's work, NOT a filter operator's work. + * + * @param op a binary comparison operator, including =, <=>, <, <=, >, >= + * @param attrLeft the left Attribute (or a column) + * @param attrRight the right Attribute (or a column) + * @param update a boolean flag to specify if we need to update ColumnStat of the given columns + * for subsequent conditions + * @return an optional double value to show the percentage of rows meeting a given condition + */ + def evaluateBinaryForTwoColumns( + op: BinaryComparison, + attrLeft: Attribute, + attrRight: Attribute, + update: Boolean): Option[Double] = { + +if (!colStatsMap.contains(attrLeft)) { + logDebug("[CBO] No statistics for " + attrLeft) + return None +} +if (!colStatsMap.contains(attrRight)) { + logDebug("[CBO] No statistics for " + attrRight) + return None +} + +attrLeft.dataType match { + case StringType | BinaryType => +// TODO: It is difficult to support other binary comparisons for String/Binary +// type without min/max and advanced statistics like histogram. +logDebug("[CBO] No range comparison statistics for String/Binary type " + attrLeft) +return None + case _ => +} + +val colStatLeft = colStatsMap(attrLeft) +val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, attrLeft.dataType) + .asInstanceOf[NumericRange] +val maxLeft = BigDecimal(statsRangeLeft.max) +val minLeft = BigDecimal(statsRangeLeft.min) + +val colStatRight = colStatsMap(attrRight) +val statsRangeRight = Range(colStatRight.min, colStatRight.max, attrRight.dataType) + .asInstanceOf[NumericRange] +val maxRight = BigDecimal(statsRangeRight.max) +val minRight = BigDecimal(statsRangeRight.min) + +// determine the overlapping degree between predicate range and column's range +val allNotNull = (colStatLeft.nullCount == 0) && (colStatRight.nullCount == 0) +val (noOverlap: Boolean, completeOverlap: Boolean) = op match { + // Left < Right or Left <= Right + // - no overlap: + // minRight maxRight minLeft maxLeft + // +--++-+---> + // - complete overlap: (If null values exists, we set it to partial overlap.) + // minLeftmaxLeft minRight maxRight + // +--++-+---> + case _: LessThan => +(minLeft >= maxRight, (maxLeft < minRight) && allNotNull) + case _: LessThanOrEqual => +(minLeft > maxRight, (maxLeft <= minRight) && allNotNull) + + // Left > Right or Left >= Right + // - no overlap: + // minLeftmaxLeft minRight maxRight + // +--++-+---> + // - complete overlap: (If null values exists, we set it to partial overlap.) + // minRight maxRight minLeft maxLeft + // +--++-+---> + case _: GreaterThan => +(maxLeft <= minRight, (minLeft > maxRight) && allNotNull) + case _: GreaterThanOrEqual => +(maxLeft < minRight, (minLeft >= maxRight) && allNotNull) + + // Left = Right or Left <=> Right + // - no overlap: + // minLeftmaxLeft minRight maxRight + // +--++-+---> + // minRight maxRight minLeft maxLeft + // +--++-+---> + // - complete overlap: + // minLeftmaxLeft + // minRight maxRight + // +--+--->
[GitHub] spark issue #17415: [SPARK-19408][SQL] filter estimation on two columns of s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17415 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17487: [Spark-20145] Fix range case insensitive bug in S...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17487 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17487: [Spark-20145] Fix range case insensitive bug in SQL
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17487 Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17505: [SPARK-20187][SQL] Replace loadTable with moveFil...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17505#discussion_r109553390 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala --- @@ -242,6 +251,16 @@ private[client] class Shim_v0_12 extends Shim with Logging { JInteger.TYPE, JBoolean.TYPE, JBoolean.TYPE) + private lazy val moveFileMethod = +findMethod( + classOf[Hive], + "moveFile", --- End diff -- does this exist in all the versions Spark supports? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17112: [WIP] Measurement for SPARK-16929.
Github user jinxing64 closed the pull request at: https://github.com/apache/spark/pull/17112 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17336#discussion_r109548396 --- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala --- @@ -85,38 +85,58 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul assert(prediction.select("prediction").where("id=3").first().getSeq[String](0).isEmpty) } + test("FPGrowth prediction should not contain duplicates") { +// This should generate rule 1 -> 3, 2 -> 3 +val dataset = spark.createDataFrame(Seq( + Array("1", "3"), + Array("2", "3") +).map(Tuple1(_))).toDF("items") +val model = new FPGrowth().fit(dataset) + +val prediction = model.transform( + spark.createDataFrame(Seq(Tuple1(Array("1", "2".toDF("items") +).first().getAs[Seq[String]]("prediction") + +assert(prediction === Seq("3")) + } + + test("FPGrowthModel setMinConfidence should affect rules generation and transform") { +val model = new FPGrowth().setMinSupport(0.1).setMinConfidence(0.1).fit(dataset) +val oldRulesNum = model.associationRules.count() +val oldPredict = model.transform(dataset) + +model.setMinConfidence(0.8765) +assert(oldRulesNum > model.associationRules.count()) + assert(!model.transform(dataset).collect().toSet.equals(oldPredict.collect().toSet)) + +// association rules should stay the same for same minConfidence +model.setMinConfidence(0.1) +assert(oldRulesNum === model.associationRules.count()) + assert(model.transform(dataset).collect().toSet.equals(oldPredict.collect().toSet)) + } + test("FPGrowth parameter check") { val fpGrowth = new FPGrowth().setMinSupport(0.4567) val model = fpGrowth.fit(dataset) .setMinConfidence(0.5678) assert(fpGrowth.getMinSupport === 0.4567) assert(model.getMinConfidence === 0.5678) +MLTestingUtils.checkCopy(model) } test("read/write") { def checkModelData(model: FPGrowthModel, model2: FPGrowthModel): Unit = { - assert(model.freqItemsets.sort("items").collect() === -model2.freqItemsets.sort("items").collect()) + assert(model.freqItemsets.collect().toSet.equals( +model2.freqItemsets.collect().toSet)) + assert(model.associationRules.collect().toSet.equals( +model2.associationRules.collect().toSet)) + assert(model.setMinConfidence(0.9).associationRules.collect().toSet.equals( +model2.setMinConfidence(0.9).associationRules.collect().toSet)) } val fPGrowth = new FPGrowth() testEstimatorAndModelReadWrite(fPGrowth, dataset, FPGrowthSuite.allParamSettings, FPGrowthSuite.allParamSettings, checkModelData) } - - test("FPGrowth prediction should not contain duplicates") { --- End diff -- For the future, I'd prefer not to move stuff around unless it's necessary since it makes the diff larger. No need to revert this, though, since I already checked it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17336#discussion_r109548283 --- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala --- @@ -85,38 +85,58 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul assert(prediction.select("prediction").where("id=3").first().getSeq[String](0).isEmpty) } + test("FPGrowth prediction should not contain duplicates") { +// This should generate rule 1 -> 3, 2 -> 3 +val dataset = spark.createDataFrame(Seq( + Array("1", "3"), + Array("2", "3") +).map(Tuple1(_))).toDF("items") +val model = new FPGrowth().fit(dataset) + +val prediction = model.transform( + spark.createDataFrame(Seq(Tuple1(Array("1", "2".toDF("items") +).first().getAs[Seq[String]]("prediction") + +assert(prediction === Seq("3")) + } + + test("FPGrowthModel setMinConfidence should affect rules generation and transform") { +val model = new FPGrowth().setMinSupport(0.1).setMinConfidence(0.1).fit(dataset) +val oldRulesNum = model.associationRules.count() +val oldPredict = model.transform(dataset) + +model.setMinConfidence(0.8765) +assert(oldRulesNum > model.associationRules.count()) + assert(!model.transform(dataset).collect().toSet.equals(oldPredict.collect().toSet)) + +// association rules should stay the same for same minConfidence +model.setMinConfidence(0.1) +assert(oldRulesNum === model.associationRules.count()) + assert(model.transform(dataset).collect().toSet.equals(oldPredict.collect().toSet)) + } + test("FPGrowth parameter check") { val fpGrowth = new FPGrowth().setMinSupport(0.4567) val model = fpGrowth.fit(dataset) .setMinConfidence(0.5678) assert(fpGrowth.getMinSupport === 0.4567) assert(model.getMinConfidence === 0.5678) +MLTestingUtils.checkCopy(model) } test("read/write") { def checkModelData(model: FPGrowthModel, model2: FPGrowthModel): Unit = { - assert(model.freqItemsets.sort("items").collect() === -model2.freqItemsets.sort("items").collect()) + assert(model.freqItemsets.collect().toSet.equals( +model2.freqItemsets.collect().toSet)) + assert(model.associationRules.collect().toSet.equals( --- End diff -- No need to add these 2 since they are values computed from the model data. Checking freqItemsets is sufficient. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/15821 Thanks for the review @viirya, I'm working on an update but want to be sure the python tests for arrow get run before I push. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r109547685 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2747,6 +2747,17 @@ class Dataset[T] private[sql]( } } + /** + * Collect a Dataset as ArrowPayload byte arrays and serve to PySpark. + */ + private[sql] def collectAsArrowToPython(): Int = { +val payloadRdd = toArrowPayloadBytes() +val payloadByteArrays = payloadRdd.collect() --- End diff -- The conversion going on in `table.to_pandas()` is working on an already loaded table, but the Arrow Readers can read multiple batches of data and output a single table. The issue is that pyspark serializers expects the data to be "framed" with the length so I can not send that directly to the Arrow Reader. Even with `toLocalIteratorAndServer` I would have to read each batch of data on the driver, then combine. It would be possible to write the "framed" stream another stream without the lengths, where it can then be then be read into a single table - but I'm not sure if that added complexity is worth it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17499: [SPARK-20161][CORE] Default log4j properties file should...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17499 Maybe Hive can do it in Hive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17521: [SPARK-20204][SQL] separate SQLConf into catalyst confs ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17521 To be clear, I don't think we should have two separate places to define config entries. If this is what the pr is doing, I strongly veto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17522: [SPARK-18278] [Scheduler] Documentation to point to Kube...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/17522 @mridulm, I understand your concern here. This is however an effort from the Kubernetes community (https://github.com/kubernetes/kubernetes/issues/34377), so, the eventuality of a different parallel effort, is unlikely. @rxin thanks for reviewing. I've updated the wording as @markhamstra just suggested. Do we want more clarification about the level of commitment or does this look ok? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17522: [SPARK-18278] [Scheduler] Documentation to point to Kube...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17522 Seems fine to me, since the number of external resource managers are small. We should definitely make it clear there is no firm commitment currently to merge this into Spark though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17521: [SPARK-20204][SQL] separate SQLConf into catalyst confs ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17521 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75492/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17521: [SPARK-20204][SQL] separate SQLConf into catalyst confs ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17521 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17521: [SPARK-20204][SQL] separate SQLConf into catalyst confs ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17521 **[Test build #75492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75492/testReport)** for PR 17521 at commit [`32aaf63`](https://github.com/apache/spark/commit/32aaf6390f7897cb2b109341d62280fbe08c9336). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/16906 I think this looks reasonable, although it would maybe make sense to add a warning if the user has explicitly requested hive support and we are falling through to non-hive support (e.g. in the except side of the try block). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17375: [SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collecti...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17375 Anaconda default to 3.6 definitely makes this make more sense, thanks @zero323 I had forgotten that. I'll give @davies until next week to say anything about this but otherwise I think the set of backports for this issue make sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17494#discussion_r109538706 --- Diff: python/pyspark/ml/stat.py --- @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol): return _java2py(sc, javaTestObj.test(*args)) +class Correlation(object): +""" +.. note:: Experimental + +Compute the correlation matrix for the input dataset of Vectors using the specified method. +Methods currently supported: `pearson` (default), `spearman`. --- End diff -- So the Scala documentation had a warning about caching being suggested when using Spearman, would it make sense to copy this warning over as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17494#discussion_r109538556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala --- @@ -56,7 +56,7 @@ object Correlation { * Here is how to access the correlation coefficient: * {{{ *val data: Dataset[Vector] = ... - *val Row(coeff: Matrix) = Statistics.corr(data, "value").head + *val Row(coeff: Matrix) = Correlation.corr(data, "value").head *// coeff now contains the Pearson correlation matrix. * }}} * --- End diff -- Also since we are here as well, there is a reference to input RDD up above in the docstring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace improvement...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/16793 Let me try and take a look tonight. It seems like there are some small formatting issues still at a quick glance but I feel like this should be close. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17328: [SPARK-19975][Python][SQL] Add map_keys and map_values f...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17328 jenkins, ok to test. Does someone on the SQL side have a chance to look at this to say if its something they want added to the DataFrame API? Maybe @marmbrus ? I'm a little hesistant with adding it to functions in this way since the `map_values` has a different meaning than `mapValues` in RDD land and it seems like that could cause some confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17523: [SPARK-20064][PySpark]
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17523 Thanks for doing this @setjet & welcome to the Spark project :) This change looks good pending jenkins, if everything passes I'll merge it tonight. For others looking at this PR wondering, make-distribution writes its own version number when building but this version number is used for development builds so keeping it up-to-date is useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17523: [SPARK-20064][PySpark]
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17523 Jenkins OK to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17508: [SPARK-20191][yarn] Crate wrapper for RackResolver so te...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17508 @srowen @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75493/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17522: [SPARK-18278] [Scheduler] Documentation to point to Kube...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/17522 I dont think we should be pointing to third party projects in spark documentation - for example, it might be possible that some other effort gets merged in instead of the above. If/when it does eventually get merged, we can add the appropriate cluster manager entry for it - until then, there are other means of evangelizing user participation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17520 **[Test build #75493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75493/testReport)** for PR 17520 at commit [`4aaab02`](https://github.com/apache/spark/commit/4aaab02b6fa384c51aef8484255f7a51097842be). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17523: [SPARK-20064][PySpark]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17523 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17422: [SPARK-20087] Attach accumulators / metrics to 'TaskKill...
Github user noodle-fb commented on the issue: https://github.com/apache/spark/pull/17422 @JoshRosen ping? not sure how to github correctly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17523: [SPARK-20064][PySpark]
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/17523 [SPARK-20064][PySpark] ## What changes were proposed in this pull request? PySpark version in version.py was lagging behind Versioning is in line with PEP 440: https://www.python.org/dev/peps/pep-0440/ ## How was this patch tested? Simply rebuild the project with existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark SPARK-20064 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17523.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17523 commit a2358f7afa8502b8272a4e7caa6c64ad9f0db27d Author: Ruben Janssen Date: 2016-07-16T15:03:19Z added a python example for chisq selector in mllib commit ca7cd787e174e04fbe0fcdcff26c8169450abc7b Author: Ruben Janssen Date: 2016-08-01T18:14:01Z updated documentation to refer to the example commit 035aeb63ef8e8f2af8f7ed838d434a069392c336 Author: Ruben Janssen Date: 2016-10-16T15:00:44Z updated with changes suggested by sethah commit f49e6aea59994c471ea0270b41d5237a1f2a6a47 Author: Ruben Janssen Date: 2016-10-16T15:09:46Z oops forgot to revert back local changes commit a45ff2fa5e5a3633d3de24c5c2f91d59824b0fc8 Author: setjet Date: 2017-04-03T19:18:42Z Merge remote-tracking branch 'upstream/master' commit 8363e28e2d400c599052120153fc08eff8253cd5 Author: setjet Date: 2017-04-03T19:53:02Z increased pyspark version commit 881470d87d499c16cfbf6ea0a265369d60ba8f80 Author: setjet Date: 2017-04-03T21:25:37Z Revert "oops forgot to revert back local changes" This reverts commit f49e6aea59994c471ea0270b41d5237a1f2a6a47. commit 09171936d5d1e9293fee6d28c44d74441a4920ab Author: setjet Date: 2017-04-03T21:26:03Z Revert "updated with changes suggested by sethah" This reverts commit 035aeb63ef8e8f2af8f7ed838d434a069392c336. commit c15654aa242d486b5eeb7e22e79915a165f6bb99 Author: setjet Date: 2017-04-03T21:26:30Z Revert "updated documentation to refer to the example" This reverts commit ca7cd787e174e04fbe0fcdcff26c8169450abc7b. commit 47e4ab2cf8794718d68b5007f4980aae175eb94e Author: setjet Date: 2017-04-03T21:26:39Z Revert "added a python example for chisq selector in mllib" This reverts commit a2358f7afa8502b8272a4e7caa6c64ad9f0db27d. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/17520 cc: @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17087: [SPARK-19372][SQL] Fix throwing a Java exception at df.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17087 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75487/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17087: [SPARK-19372][SQL] Fix throwing a Java exception at df.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17087 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org