[GitHub] spark pull request: [SPARK-11448][SQL] Skip caching part-files in ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9405#issuecomment-153017266 **[Test build #44805 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44805/consoleFull)** for PR 9405 at commit [`78f1e95`](https://github.com/apache/spark/commit/78f1e959f6e9ad1272aef4e6beeac715c37a777b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Corr(`\n * `case class Corr(left: Expression, right: Expression)`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11448][SQL] Skip caching part-files in ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9405#issuecomment-153017384 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44805/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10786][SQL]Take the whole statement to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8895 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11344: Made ApplicationDescription and D...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/9299#issuecomment-153020674 Looking good to me. Let me leave it a bit for other input. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11450][SQL] Add Unsafe Row processing t...
GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/9414 [SPARK-11450][SQL] Add Unsafe Row processing to Expand This PR enables the Expand operator to process and produce Unsafe Rows. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-11450 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9414.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9414 commit 312885ccda170b34477f26abaf51be3c4a28141a Author: Herman van HovellDate: 2015-11-02T14:13:46Z Add unsafe row processing to Expand --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10533] [SQL] handle scientific notation...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9085#discussion_r43633782 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala --- @@ -102,8 +106,12 @@ class SqlLexical extends StdLexical { } override lazy val token: Parser[Token] = -( identChar ~ (identChar | digit).* ^^ - { case first ~ rest => processIdent((first :: rest).mkString) } +( rep1(digit) ~ ('.' ~> digit.*).? ~ (exp ~> sign.? ~ rep1(digit)) ^^ { +case i ~ None ~ (sig ~ rest) => + DecimalLit(i.mkString + "e" + sig.getOrElse("") + rest.mkString) --- End diff -- Nit: `sig.getOrElse("")` can be simplified to `sig.mkString`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-6373 Add SSL/TLS for the Netty based Blo...
GitHub user turp1twin opened a pull request: https://github.com/apache/spark/pull/9416 Spark-6373 Add SSL/TLS for the Netty based BlockTransferService Sorry if this pull request is premature, but I have received very little feedback, so I am going ahead and creating it. I am still open to comments/feedback and can continue to make changes if necessary. Here are some comments about my implementation... *Configuration:* I added a new SSLOptions member variable to SecurityManager.scala, specifically for configuring SSL for the Block Transfer Service: {code:title=SecurityManager.scala|linenumbers=false|language=scala} val btsSSLOptions = SSLOptions.parse(sparkConf, "spark.ssl.bts", Some(defaultSSLOptions)) {code} I expanded the SSLOptions case class to capture additional SSL related parameters: {code:title=SecurityManager.scala|linenumbers=false|language=scala} private[spark] case class SSLOptions( enabled: Boolean = false, keyStore: Option[File] = None, keyStorePassword: Option[String] = None, privateKey: Option[File] = None, keyPassword: Option[String] = None, certChain: Option[File] = None, trustStore: Option[File] = None, trustStorePassword: Option[String] = None, trustStoreReloadingEnabled: Boolean = false, trustStoreReloadInterval: Int = 1, openSslEnabled: Boolean = false, protocol: Option[String] = None, enabledAlgorithms: Set[String] = Set.empty) {code} I added the ability to provide a standard java keystore and truststore, as was possible with the existing file server and akka SSL configurations available in SecurityManager.scala. When using a keystore/truststore I also added the ability to enable truststore reloading (hadoop encrypted shuffle allows for this). In addition, I added the ability to specify an X.509 certificate chain in PEM format and a PKCS#8 private key file in PEM format. If all four parameters are provided (keyStore, trustStore, privateKey, certChain) then the privateKey and certChain parameters will be used. In TransportConf.java I added two addition configuration parameters: {code:title=TransportConf.java|linenumbers=false|language=java} public int sslShuffleChunkSize() { return conf.getInt("spark.shuffle.io.ssl.chunkSize", 60 * 1024); } public boolean sslShuffleEnabled() { return conf.getBoolean("spark.ssl.bts.enabled", false); } {code} For the _"spark.shuffle.io.ssl.chunkSize"_ config param I set the default to the same size used in Hadoop's encrypted shuffle implementation. *Implementation:* For this implementation, I opted to disrupt as little code as possible, meaning I wanted to avoid any major refactoring... Basically the TransportContext class handles the SSL setup internally based on settings in the passed TransportConf. This way none of the method signatures (i.e., createServer, etc) had to change. I opted to not use the TransportClientBootstrap/TransportServerBootstrap interfaces as they were not a good fit. Basically the TransportClientBootstrap is called to late as the client Netty pipeline for SSL needs to be setup earlier in the connection process. The TransportServerBootstrap could have been used, but IMO, it would have been a bit hacky as the doBootstrap method takes an RpcHandler and returns one, which in the case of SSL bootstrapping is not needed. Also, only using the TransportServerBootstrap and not the TransportClientBootstrap would have made its usage seem inconsistent. Anyways, these are just some initial comments about the implementation. Definitely looking for feedback... If someone has a better alternative I am all for it, just wanted to get something working with minimal invasive changes to the codebase... This is a pretty important feature for my company as we are in the healthcare space and are require HIPAA compliance (data encrypted at rest and in transit). Thanks! Jeff You can merge this pull request into a Git repository by running: $ git pull https://github.com/turp1twin/spark SPARK-6373 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9416.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9416 commit fd2980ab8cc1fc5b4626bb7a0d1e94128ca3874d Author: turp1twinDate: 2015-10-31T20:26:14Z Merged ssl-shuffle-latest commit a7f915aecea4492d9a41b7310eb465cb32d7ef14 Author: turp1twin Date: 2015-11-01T20:50:09Z Added new SSL Netty Shuffle test and SSL YarnShuffleService test, cleaned up merge issue in TransportServer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as
[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9417#issuecomment-153059875 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11448][SQL] Skip caching part-files in ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9405#issuecomment-153017380 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9415#issuecomment-153050470 **[Test build #44813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/consoleFull)** for PR 9415 at commit [`788c1a1`](https://github.com/apache/spark/commit/788c1a1675b1470d09175cdf31f2131e8d4767ac). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7753#issuecomment-153050399 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9415#issuecomment-153050637 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9415#issuecomment-153050635 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...
GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/9417 [SPARK-11449][Core] PortableDataStream should be a factory ```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces). This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns. cc @srowen You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-11449 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9417.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9417 commit 91b8c6cc86c83df161e176c7a4efbc3dd439d037 Author: Herman van HovellDate: 2015-11-02T15:42:44Z Removed state from PortableDataStream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153012486 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10786][SQL]Take the whole statement to ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/8895#issuecomment-153014703 LGTM. Thanks for fixing this! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11450][SQL] Add Unsafe Row processing t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9414#issuecomment-153028650 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/9413#issuecomment-153031528 In the current implementation we provide ```Std. Error``` for ```coefficients``` excepts ```intercept```, because that we use optimized method to calculate ```intercept```. If we want to calculate ```Std. Error``` for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` like ```scala val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0)) val newAtB = Array.concat(abBar.values, Array(bBar)) val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB) val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k) ``` I'm afraid that it will cause performance degradation, so I propose output ```Std. Error``` only for ```coefficients``. May be here we should discuss, or figure out better way to output ```Std. Error``` for ```intercept``. @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/9398#issuecomment-153050315 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7753#issuecomment-153050402 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44815/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7753#issuecomment-153050394 **[Test build #44815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44815/consoleFull)** for PR 7753 at commit [`a8fcf74`](https://github.com/apache/spark/commit/a8fcf74752e2bfed697280d00383b137b1994ae7). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class ExecutorMetrics extends Serializable `\n * `case class TransportMetrics(`\n * `class MemoryListener extends SparkListener `\n * `class MemoryUIInfo `\n * `class TransportMemSize `\n * `case class MemTime(memorySize: Long = 0L, timeStamp: Long = 0L)`\n * `case class Corr(`\n * `case class Corr(left: Expression, right: Expression)`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11455][SQL] fix case sensitivity of par...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9410#issuecomment-153030063 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44806/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/9413#issuecomment-153031344 In the current implementation we provide ```Std. Error``` for ```coefficients``` excepts ```intercept```, because that we use optimized method to calculate ```intercept```. If we want to calculate ```Std. Error``` for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` like ```scala val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0)) val newAtB = Array.concat(abBar.values, Array(bBar)) val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB) val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k) ``` I'm afraid that it will cause performance degradation, so I propose output ```Std. Error``` only for ```coefficients``. May be here we should discuss, or figure out better way to output ```Std. Error``` for ```intercept``. @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/9413#issuecomment-153031428 In the current implementation we provide ```Std. Error``` for ```coefficients``` excepts ```intercept```, because that we use optimized method to calculate ```intercept```. If we want to calculate ```Std. Error``` for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` like ```scala val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0)) val newAtB = Array.concat(abBar.values, Array(bBar)) val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB) val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k) ``` I'm afraid that it will cause performance degradation, so I propose output ```Std. Error``` only for ```coefficients``. May be here we should discuss, or figure out better way to output ```Std. Error``` for ```intercept``. @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/9413#issuecomment-153031391 In the current implementation we provide ```Std. Error``` for ```coefficients``` excepts ```intercept```, because that we use optimized method to calculate ```intercept```. If we want to calculate ```Std. Error``` for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` like ```scala val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0)) val newAtB = Array.concat(abBar.values, Array(bBar)) val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB) val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k) ``` I'm afraid that it will cause performance degradation, so I propose output ```Std. Error``` only for ```coefficients``. May be here we should discuss, or figure out better way to output ```Std. Error``` for ```intercept``. @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10533] [SQL] handle scientific notation...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9085#discussion_r43633899 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -169,6 +169,26 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { checkAnswer( testData.filter("key > 90"), testData.collect().filter(_.getInt(0) > 90).toSeq) + +checkAnswer( + testData.filter("key > 9.0e1"), + testData.collect().filter(_.getInt(0) > 90).toSeq) + +checkAnswer( + testData.filter("key > .9e+2"), + testData.collect().filter(_.getInt(0) > 90).toSeq) + +checkAnswer( + testData.filter("key > 0.9e+2"), + testData.collect().filter(_.getInt(0) > 90).toSeq) + +checkAnswer( + testData.filter("key > 900e-1"), + testData.collect().filter(_.getInt(0) > 90).toSeq) + +checkAnswer( + testData.filter("key > 900.0E-1"), + testData.collect().filter(_.getInt(0) > 90).toSeq) --- End diff -- Could you please add a case for literals like `9.e+1`? This should be accepted according to the defined parser rules. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7753#issuecomment-153048340 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153050846 **[Test build #44814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44814/consoleFull)** for PR 9399 at commit [`7c17dd1`](https://github.com/apache/spark/commit/7c17dd1adf1d6134a67a07a73f3ffe56d713b6c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/9398#issuecomment-153057423 Thanks @andrewor14 , I'll look into this soon! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9413#issuecomment-153029589 **[Test build #44812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44812/consoleFull)** for PR 9413 at commit [`655fb43`](https://github.com/apache/spark/commit/655fb436950e44e1783a2bc3767e40a0295ce83f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9415#issuecomment-153040844 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9415#issuecomment-153043608 **[Test build #44813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/consoleFull)** for PR 9415 at commit [`788c1a1`](https://github.com/apache/spark/commit/788c1a1675b1470d09175cdf31f2131e8d4767ac). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/9417#issuecomment-153059923 @kmader what do you think of this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11453][SQL] append data to partitioned ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9408#issuecomment-153013113 **[Test build #44804 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44804/consoleFull)** for PR 9408 at commit [`b1512b0`](https://github.com/apache/spark/commit/b1512b0bcb80d5621f43954989403a85fdab0960). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Corr(`\n * `case class Corr(left: Expression, right: Expression)`\n * `case class RepartitionByExpression(`\n * ` logInfo(s\"Hive class not found $e\")`\n * `logDebug(\"Hive class not found\", e)`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153012488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44811/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153013074 **[Test build #44810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44810/consoleFull)** for PR 9412 at commit [`7e07efe`](https://github.com/apache/spark/commit/7e07efe701cf9dffaaf8411b108bdd2b3ca99f91). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153012481 **[Test build #44811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44811/consoleFull)** for PR 9399 at commit [`b658aaa`](https://github.com/apache/spark/commit/b658aaa8f5221ba55c1acbba21e496cc40ff6f45). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/9415 [SPARK-11458][SQL] add word count example for Dataset You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark wordcount Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9415.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9415 commit 788c1a1675b1470d09175cdf31f2131e8d4767ac Author: Wenchen FanDate: 2015-11-02T14:40:52Z word count example for Dataset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/9413#discussion_r43634982 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -471,6 +484,59 @@ class LinearRegressionSummary private[regression] ( predictions.select(t(col(predictionCol), col(labelCol)).as("residuals")) } + lazy val numInstances: Long = predictions.count() + + lazy val dfe = if (model.getFitIntercept) { +numInstances - model.weights.size -1 + } else { +numInstances - model.weights.size + } + + lazy val devianceResiduals: Array[Double] = { +val weighted = if (model.getWeightCol.isEmpty) lit(1.0) else sqrt(col(model.getWeightCol)) +val dr = predictions.select(col(model.getLabelCol).minus(col(model.getPredictionCol)) + .multiply(weighted).as("weightedResiduals")) + .select(min(col("weightedResiduals")).as("min"), max(col("weightedResiduals")).as("max")) + .take(1)(0) +Array(dr.getDouble(0), dr.getDouble(1)) --- End diff -- DataFrame currently does not provide interface to calculate percentile (only Hive UDAF), so here we only provide max and min value of deviance residuals. [SPARK-9299](https://issues.apache.org/jira/browse/SPARK-9299) works on providing ```percentile``` and ```percentile_approx``` aggregate functions, after it was resolved we can provide deviance residuals of quantile (0.25, 0.5, 0.75). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11456] [TESTS] Remove deprecated junit....
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9411#issuecomment-153028631 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44809/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11456] [TESTS] Remove deprecated junit....
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9411#issuecomment-153028628 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153028669 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44810/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11456] [TESTS] Remove deprecated junit....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9411#issuecomment-153028484 **[Test build #44809 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44809/consoleFull)** for PR 9411 at commit [`058d824`](https://github.com/apache/spark/commit/058d8245467aa36225789b9d0cef8a95e3d970d5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153028668 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9415#issuecomment-153040919 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-6373 Add SSL/TLS for the Netty based Blo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9416#issuecomment-153058457 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9417#issuecomment-153060623 **[Test build #1967 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1967/consoleFull)** for PR 9417 at commit [`91b8c6c`](https://github.com/apache/spark/commit/91b8c6cc86c83df161e176c7a4efbc3dd439d037). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11453][SQL] append data to partitioned ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9408#issuecomment-153013228 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11453][SQL] append data to partitioned ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9408#issuecomment-153013229 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44804/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/9413 [SPARK-9836] [ML] Provide R-like summary statistics for OLS via normal equation solver https://issues.apache.org/jira/browse/SPARK-9836 You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-9836 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9413.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9413 commit 655fb436950e44e1783a2bc3767e40a0295ce83f Author: Yanbo LiangDate: 2015-11-02T14:07:56Z Provide R-like summary statistics for OLS via normal equation solver --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153028495 **[Test build #44810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44810/consoleFull)** for PR 9412 at commit [`7e07efe`](https://github.com/apache/spark/commit/7e07efe701cf9dffaaf8411b108bdd2b3ca99f91). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11455][SQL] fix case sensitivity of par...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9410#issuecomment-153029895 **[Test build #44806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44806/consoleFull)** for PR 9410 at commit [`0f552c4`](https://github.com/apache/spark/commit/0f552c42926bfbd368eb9e19cc93ed625cd67d7f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11311] [SQL] spark cannot describe temp...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9277#issuecomment-153044589 LGTM, merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7753#issuecomment-153048362 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153048329 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153048363 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9398#issuecomment-153051220 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9398#issuecomment-153051172 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153011716 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153011717 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/9412 [SPARK-11457][Streaming][YARN] Fix incorrect AM proxy filter conf recovery from checkpoint Currently Yarn AM proxy filter configuration is recovered from checkpoint file when Spark Streaming application is restarted, which will lead to some unwanted behaviors: 1. Wrong RM address if RM is redeployed from failure. 2. Wrong proxyBase, since app id is updated, old app id for proxyBase is wrong. So instead of recovering from checkpoint file, these configurations should be reloaded each time when app started. This problem only exists in Yarn cluster mode, for Yarn client mode, these configurations will be updated with RPC message `AddWebUIFilter`. Please help to review @tdas @harishreedharan @vanzin , thanks a lot. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-11457 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9412.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9412 commit 7e07efe701cf9dffaaf8411b108bdd2b3ca99f91 Author: jerryshaoDate: 2015-11-02T12:30:35Z Fix Spark Streaming checkpoint with Yarn-cluster configuration recovery issue --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153011694 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153011679 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153012036 **[Test build #44811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44811/consoleFull)** for PR 9399 at commit [`b658aaa`](https://github.com/apache/spark/commit/b658aaa8f5221ba55c1acbba21e496cc40ff6f45). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8467][MLlib][PySpark] Add LDAModel.desc...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/8643#discussion_r43657655 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/LDAModelWrapper.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.mllib.api.python + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.mllib.clustering.LDAModel +import org.apache.spark.mllib.linalg.Matrix +import org.apache.spark.sql.{DataFrame, SQLContext} + +/** + * Wrapper around LDAModel to provide helper methods in Python + */ +private[python] class LDAModelWrapper(model: LDAModel) { + + def topicsMatrix(): Matrix = model.topicsMatrix + + def vocabSize(): Int = model.vocabSize + + def describeTopics(jsc: JavaSparkContext): DataFrame = describeTopics(this.model.vocabSize, jsc) + + def describeTopics(maxTermsPerTopic: Int, jsc: JavaSparkContext): DataFrame = { +// Since the return value of `describeTopics` is a little complicated, +// it is converted into `Row` to take advantage of DataFrame serialization. +val sqlContext = new SQLContext(jsc.sc) +val topics = model.describeTopics(maxTermsPerTopic) +sqlContext.createDataFrame(topics).toDF("terms", "termWeights") --- End diff -- Serialize a DataFrame will trigger a Spark job, we could still use Pickle to serialize them without DataFrame, via `PythomMLLibAPI.dumps()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9404#issuecomment-153099394 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9404#issuecomment-153099409 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9404#issuecomment-153099382 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9162][SQL] Implement code generation fo...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/9270#discussion_r43658099 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -959,6 +963,122 @@ case class ScalaUDF( } } + // Generate codes used to convert the arguments to Scala type for user-defined funtions + private[this] def genCodeForConverter(ctx: CodeGenContext, index: Int): String = { +val converterClassName = classOf[Any => Any].getName +val typeConvertersClassName = CatalystTypeConverters.getClass.getName + ".MODULE$" +val expressionClassName = classOf[Expression].getName +val scalaUDFClassName = classOf[ScalaUDF].getName + +val converterTerm = ctx.freshName("converter" + index.toString) --- End diff -- Why `index` here? All freshName will be unique. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11371 Make "mean" an alias for "avg" ope...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9332#issuecomment-153103900 @ted-yu Can you modify a test to use this alias? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/9412#issuecomment-153115388 LGTM as far as I understand this code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10997] [core] Add "client mode" to nett...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/9210#issuecomment-153116416 Merging this to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10622] [core] [yarn] Differentiate dead...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/8887#issuecomment-153117330 Hi @kayousterhout , are you ok with my explanation above? I'd like to get this is soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11420 Updating Stddev support via Impera...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9380#discussion_r43662570 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala --- @@ -1135,7 +992,76 @@ abstract class CentralMomentAgg(child: Expression) extends ImperativeAggregate w moments(4) = buffer.getDouble(fourthMomentOffset) } -getStatistic(n, mean, moments) +if (n == 0.0) null +else if (n == 1.0) 0.0 --- End diff -- I don't believe we want this behavior, since these edge cases should be handled in the `getStatistic` implementation. If you see [previous PR](https://github.com/apache/spark/pull/9003) we established that `Skewness` and `Kurtosis` should yield `Double.NaN` when `n == 1.0` but other functions like `VariancePop` should yield 0.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10622] [core] [yarn] Differentiate dead...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8887#issuecomment-153117844 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/9404#discussion_r43662693 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala --- @@ -353,4 +354,44 @@ class CachedTableSuite extends QueryTest with SharedSQLContext { assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) assert(sparkPlan.collect { case e: PhysicalRDD => e }.size === 0) } + + /** + * Verifies that the plan for `df` contains `expected` number of Exchange operators. + */ + private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = { +assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.size == expected) + } + + test("A cached table preserves the partitioning and ordering of its cached SparkPlan") { +val table3x = testData.unionAll(testData).unionAll(testData) +table3x.registerTempTable("testData3x") + +sql("SELECT key, value FROM testData3x ORDER BY key").registerTempTable("orderedTable") +sqlContext.cacheTable("orderedTable") +assertCached(sqlContext.table("orderedTable")) +// Should not have an exchange as the query is already sorted on the group by key. +verifyNumExchanges(sql("SELECT key, count(*) FROM orderedTable GROUP BY key"), 0) +checkAnswer( + sql("SELECT key, count(*) FROM orderedTable GROUP BY key ORDER BY key"), + sql("SELECT key, count(*) FROM testData3x GROUP BY key ORDER BY key").collect()) +sqlContext.uncacheTable("orderedTable") + +// Set up two tables distributed in the same way. +testData.distributeBy(Column("key") :: Nil, 5).registerTempTable("t1") +testData2.distributeBy(Column("a") :: Nil, 5).registerTempTable("t2") +sqlContext.cacheTable("t1") +sqlContext.cacheTable("t2") + +// Joining them should result in no exchanges. +verifyNumExchanges(sql("SELECT * FROM t1 t1 JOIN t2 t2 ON t1.key = t2.a"), 0) --- End diff -- ah seems partitioning the data to `5` partitions does the trick at here (the default parallelism is set to 5 in our tests). If you change it tom something like `10`, this test will fail... Unfortunately, we do not have the concept of equivalent class right now. So, at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L229-L240, `allCompatible` method does not really do what we want at here (btw, `allCompatible` method is trying to make sure that partitioning schemes of all children are compatible with each other, i.e. making sure they partition the data with the same partitioner and with the same number of partitions). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153119257 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10997] [core] Add "client mode" to nett...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/9210#issuecomment-153119171 Just took another look. LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153121113 **[Test build #44820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/consoleFull)** for PR 9383 at commit [`53dbdf2`](https://github.com/apache/spark/commit/53dbdf2d4c8c547e6bd50a589bf0223e7ce95e84). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10827] [CORE] AppClient should not use ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9317#issuecomment-153121675 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153121656 **[Test build #44820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/consoleFull)** for PR 9383 at commit [`53dbdf2`](https://github.com/apache/spark/commit/53dbdf2d4c8c547e6bd50a589bf0223e7ce95e84). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153121664 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9415#discussion_r43648991 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.sql + +import org.apache.spark.sql.{Dataset, SQLContext} +import org.apache.spark.{SparkContext, SparkConf} + +object DatasetWordCount { + def main(args: Array[String]): Unit = { +val sparkConf = new SparkConf().setAppName("DatasetWordCount") +val sc = new SparkContext(sparkConf) +val sqlContext = new SQLContext(sc) + +// Importing the SQL context gives access to all the SQL functions and implicit conversions. +import sqlContext.implicits._ + +val lines: Dataset[String] = Seq("hello world", "say hello to the world").toDS() +val words: Dataset[(String, Int)] = lines.flatMap(_.split(" ")).map(word => word -> 1) +val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups { + case (word, iter) => Iterator(word -> iter.length) +} + +counts.foreach { case (word, count) => println(s"$word: $count") } --- End diff -- We should `collect()` here so this doesn't run on the executors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153095366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44814/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9399#issuecomment-153095363 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8467][MLlib][PySpark] Add LDAModel.desc...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/8643#discussion_r43658405 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/LDAModelWrapper.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.mllib.api.python + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.mllib.clustering.LDAModel +import org.apache.spark.mllib.linalg.Matrix +import org.apache.spark.sql.{DataFrame, SQLContext} + +/** + * Wrapper around LDAModel to provide helper methods in Python + */ +private[python] class LDAModelWrapper(model: LDAModel) { + + def topicsMatrix(): Matrix = model.topicsMatrix + + def vocabSize(): Int = model.vocabSize + + def describeTopics(jsc: JavaSparkContext): DataFrame = describeTopics(this.model.vocabSize, jsc) + + def describeTopics(maxTermsPerTopic: Int, jsc: JavaSparkContext): DataFrame = { +// Since the return value of `describeTopics` is a little complicated, +// it is converted into `Row` to take advantage of DataFrame serialization. +val sqlContext = new SQLContext(jsc.sc) +val topics = model.describeTopics(maxTermsPerTopic) +sqlContext.createDataFrame(topics).toDF("terms", "termWeights") --- End diff -- @davies thanks for the comment. Should we rather `PythonMLlibAPI.dmups()` than Java Any types like below? https://github.com/yu-iskw/spark/commit/e1c66d050f7c4edbe1bf4e3b57b145cc62c23630#diff-71f42172be0b5fc14827b7bb31f4e80bR34 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9404#issuecomment-153100427 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9404#issuecomment-153100484 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9392#issuecomment-153104431 **[Test build #1969 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1969/consoleFull)** for PR 9392 at commit [`a7c395f`](https://github.com/apache/spark/commit/a7c395f9fe7bf43b1e63af060b425fa6047b25f9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/9392#issuecomment-153104500 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9276#issuecomment-153106533 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9817][YARN] Improve the locality calcul...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/8100#issuecomment-153115647 Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation
Github user davies commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153118810 After some benchmark, realized that using hashcode as prefix in timsort will cause regression in timsort and snappy compression (especially for aggregation after join, the order of records will become random). I will revert that part. benchmark code: ``` sqlContext.setConf("spark.sql.shuffle.partitions", "1") N = 1<<25 M = 1<<20 df = sqlContext.range(N).selectExpr("id", "repeat(id, 2) as s") df.show() df2 = df.select(df.id.alias('id2'), df.s.alias('s2')) j = df.join(df2, df.id==df2.id2).groupBy(df.s).max("id", "id2") n = j.count() ``` Another interesting finding is that Snappy will slowdown the spilling by 50% of end-to-end time, LZ4 will be faster than Snappy, but still 10% slower than not-compressed. Should we use `false` as the default value for `spark.shuffle.spill.compress`?(PS: tested on Mac with SSD, it may not be true on HD) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8029][core] first successful shuffle ta...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/9214#discussion_r43664109 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java --- @@ -121,13 +125,22 @@ public BypassMergeSortShuffleWriter( } @Override - public void write(Iterator> records) throws IOException { + public Seq > write(Iterator > records) throws IOException { assert (partitionWriters == null); +final File indexFile = shuffleBlockResolver.getIndexFile(shuffleId, mapId); +final File dataFile = shuffleBlockResolver.getDataFile(shuffleId, mapId); if (!records.hasNext()) { partitionLengths = new long[numPartitions]; - shuffleBlockResolver.writeIndexFile(shuffleId, mapId, partitionLengths); + final File tmpIndexFile = shuffleBlockResolver.writeIndexFile(shuffleId, mapId, partitionLengths); mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths); - return; + // create empty data file so we always commit same set of shuffle output files, even if + // data is non-deterministic + final File tmpDataFile = blockManager.diskBlockManager().createTempShuffleBlock()._2(); + tmpDataFile.createNewFile(); --- End diff -- Check the return value? This method doesn't throw on error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8029][core] first successful shuffle ta...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/9214#discussion_r43664703 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java --- @@ -121,13 +125,22 @@ public BypassMergeSortShuffleWriter( } @Override - public void write(Iterator> records) throws IOException { + public Seq > write(Iterator > records) throws IOException { --- End diff -- Could you add a javadoc explaining what the return value is? It's particularly cryptic because it's uses tuples; maybe it would be better to create a helper type where the fields have proper names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9392 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8029][core] first successful shuffle ta...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/9214#discussion_r43664936 --- Diff: core/src/main/scala/org/apache/spark/shuffle/FileShuffleBlockResolver.scala --- @@ -132,6 +134,15 @@ private[spark] class FileShuffleBlockResolver(conf: SparkConf) logWarning(s"Error deleting ${file.getPath()}") } } +for (mapId <- state.completedMapTasks.asScala) { + val mapStatusFile = + blockManager.diskBlockManager.getFile(ShuffleMapStatusBlockId(shuffleId, mapId)) + if (mapStatusFile.exists()) { +if (!mapStatusFile.delete()) { --- End diff -- nit: you could merge the two `if`s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11438] [SQL] Allow users to define nond...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/9393#discussion_r43657874 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala --- @@ -191,4 +193,86 @@ class UDFSuite extends QueryTest with SharedSQLContext { // pass a decimal to intExpected. assert(sql("SELECT intExpected(1.0)").head().getInt(0) === 1) } + + private def checkNumUDFs(df: DataFrame, expectedNumUDFs: Int): Unit = { +val udfs = df.queryExecution.optimizedPlan.collect { + case p: logical.Project => p.projectList.flatMap { +case e => e.collect { + case udf: ScalaUDF => udf +} + } +}.flatMap(functions => functions) +assert(udfs.length === expectedNumUDFs) + } + + test("nondeterministic udf: using UDFRegistration") { +import org.apache.spark.sql.functions._ + +val deterministicUDF = sqlContext.udf.register("plusOne1", (x: Int) => x + 1) +val nondeterministicUDF = deterministicUDF.nonDeterministic +sqlContext.udf.register("plusOne2", nondeterministicUDF) + +{ + val df = sql("SELECT 1 as a") +.select(col("a"), deterministicUDF(col("a")).as("b")) +.select(col("a"), col("b"), deterministicUDF(col("b")).as("c")) + checkNumUDFs(df, 3) --- End diff -- The default value of `foldable` is false, which is why we see three expressions at here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9276#issuecomment-153106415 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/9404#discussion_r43660128 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala --- @@ -353,4 +354,44 @@ class CachedTableSuite extends QueryTest with SharedSQLContext { assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) assert(sparkPlan.collect { case e: PhysicalRDD => e }.size === 0) } + + /** + * Verifies that the plan for `df` contains `expected` number of Exchange operators. + */ + private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = { +assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.size == expected) + } + + test("A cached table preserves the partitioning and ordering of its cached SparkPlan") { +val table3x = testData.unionAll(testData).unionAll(testData) +table3x.registerTempTable("testData3x") + +sql("SELECT key, value FROM testData3x ORDER BY key").registerTempTable("orderedTable") +sqlContext.cacheTable("orderedTable") +assertCached(sqlContext.table("orderedTable")) +// Should not have an exchange as the query is already sorted on the group by key. +verifyNumExchanges(sql("SELECT key, count(*) FROM orderedTable GROUP BY key"), 0) +checkAnswer( + sql("SELECT key, count(*) FROM orderedTable GROUP BY key ORDER BY key"), + sql("SELECT key, count(*) FROM testData3x GROUP BY key ORDER BY key").collect()) +sqlContext.uncacheTable("orderedTable") + +// Set up two tables distributed in the same way. +testData.distributeBy(Column("key") :: Nil, 5).registerTempTable("t1") +testData2.distributeBy(Column("a") :: Nil, 5).registerTempTable("t2") +sqlContext.cacheTable("t1") +sqlContext.cacheTable("t2") + +// Joining them should result in no exchanges. +verifyNumExchanges(sql("SELECT * FROM t1 t1 JOIN t2 t2 ON t1.key = t2.a"), 0) + +// Grouping on the partition key should result in no exchanges +verifyNumExchanges(sql("SELECT count(*) FROM t1 GROUP BY key"), 0) + +// TODO: this is an issue with self joins. The number of exchanges should be 0. +verifyNumExchanges(sql("SELECT * FROM t1 t1 JOIN t1 t2 on t1.key = t2.key"), 1) + +sqlContext.uncacheTable("t1") +sqlContext.uncacheTable("t2") + } --- End diff -- How about we also check the results? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org