[GitHub] spark pull request: [SPARK-11448][SQL] Skip caching part-files in ...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9405#issuecomment-153017266
  
**[Test build #44805 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44805/consoleFull)**
 for PR 9405 at commit 
[`78f1e95`](https://github.com/apache/spark/commit/78f1e959f6e9ad1272aef4e6beeac715c37a777b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Corr(`\n  * `case class Corr(left: Expression, right: Expression)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11448][SQL] Skip caching part-files in ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9405#issuecomment-153017384
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44805/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10786][SQL]Take the whole statement to ...

2015-11-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8895


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-11344: Made ApplicationDescription and D...

2015-11-02 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/9299#issuecomment-153020674
  
Looking good to me. Let me leave it a bit for other input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11450][SQL] Add Unsafe Row processing t...

2015-11-02 Thread hvanhovell
GitHub user hvanhovell opened a pull request:

https://github.com/apache/spark/pull/9414

[SPARK-11450][SQL] Add Unsafe Row processing to Expand

This PR enables the Expand operator to process and produce Unsafe Rows.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hvanhovell/spark SPARK-11450

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9414


commit 312885ccda170b34477f26abaf51be3c4a28141a
Author: Herman van Hovell 
Date:   2015-11-02T14:13:46Z

Add unsafe row processing to Expand




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10533] [SQL] handle scientific notation...

2015-11-02 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9085#discussion_r43633782
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala
 ---
@@ -102,8 +106,12 @@ class SqlLexical extends StdLexical {
   }
 
   override lazy val token: Parser[Token] =
-( identChar ~ (identChar | digit).* ^^
-  { case first ~ rest => processIdent((first :: rest).mkString) }
+( rep1(digit) ~ ('.' ~> digit.*).? ~ (exp ~> sign.? ~ rep1(digit)) ^^ {
+case i ~ None ~ (sig ~ rest) =>
+  DecimalLit(i.mkString + "e" + sig.getOrElse("") + rest.mkString)
--- End diff --

Nit: `sig.getOrElse("")` can be simplified to `sig.mkString`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-6373 Add SSL/TLS for the Netty based Blo...

2015-11-02 Thread turp1twin
GitHub user turp1twin opened a pull request:

https://github.com/apache/spark/pull/9416

Spark-6373 Add SSL/TLS for the Netty based BlockTransferService

Sorry if this pull request is premature, but I have received very little 
feedback, so I am going ahead and creating it. I am still open to 
comments/feedback and can continue to make changes if necessary. Here are some 
comments about my implementation...

*Configuration:*

I added a new SSLOptions member variable to SecurityManager.scala, 
specifically for configuring SSL for the Block Transfer Service:
{code:title=SecurityManager.scala|linenumbers=false|language=scala}
val btsSSLOptions = SSLOptions.parse(sparkConf, "spark.ssl.bts", 
Some(defaultSSLOptions))
{code}

I expanded the SSLOptions case class to capture additional SSL related 
parameters:
{code:title=SecurityManager.scala|linenumbers=false|language=scala}
private[spark] case class SSLOptions(
  enabled: Boolean = false,
  keyStore: Option[File] = None,
  keyStorePassword: Option[String] = None,
  privateKey: Option[File] = None,
  keyPassword: Option[String] = None,
  certChain: Option[File] = None,
  trustStore: Option[File] = None,
  trustStorePassword: Option[String] = None,
  trustStoreReloadingEnabled: Boolean = false,
  trustStoreReloadInterval: Int = 1,
  openSslEnabled: Boolean = false,
  protocol: Option[String] = None,
  enabledAlgorithms: Set[String] = Set.empty)
{code}

I added the ability to provide a standard java keystore and truststore, as 
was possible with the existing file server and akka SSL configurations 
available in SecurityManager.scala. When using a keystore/truststore I also 
added the ability to enable truststore reloading (hadoop encrypted shuffle 
allows for this). In addition, I added the ability to specify an X.509 
certificate chain in PEM format and a PKCS#8 private key file in PEM format. If 
all four parameters are provided (keyStore, trustStore, privateKey, certChain) 
then the privateKey and certChain parameters will be used.

In TransportConf.java I added two addition configuration parameters:
{code:title=TransportConf.java|linenumbers=false|language=java}
  public int sslShuffleChunkSize() {
return conf.getInt("spark.shuffle.io.ssl.chunkSize", 60 * 1024);
  }

  public boolean sslShuffleEnabled() {
return conf.getBoolean("spark.ssl.bts.enabled", false);
  }
{code}

For the _"spark.shuffle.io.ssl.chunkSize"_ config param I set the default 
to the same size used in Hadoop's encrypted shuffle implementation.

*Implementation:*

For this implementation, I opted to disrupt as little code as possible, 
meaning I wanted to avoid any major refactoring... Basically the 
TransportContext class handles the SSL setup internally based on settings in 
the passed TransportConf. This way none of the method signatures (i.e., 
createServer, etc) had to change. I opted to not use the 
TransportClientBootstrap/TransportServerBootstrap interfaces as they were not a 
good fit. Basically the TransportClientBootstrap is called to late as the 
client Netty pipeline for SSL needs to be setup earlier in the connection 
process. The TransportServerBootstrap could have been used, but IMO, it would 
have been a bit hacky as the doBootstrap method takes an RpcHandler and returns 
one, which in the case of SSL bootstrapping is not needed. Also, only using the 
TransportServerBootstrap and not the TransportClientBootstrap would have made 
its usage seem inconsistent.

Anyways, these are just some initial comments about the implementation. 
Definitely looking for feedback... If someone has a better alternative I am all 
for it, just wanted to get something working with minimal invasive changes to 
the codebase... This is a pretty important feature for my company as we are in 
the healthcare space and are require HIPAA compliance (data encrypted at rest 
and in transit). Thanks!

Jeff


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/turp1twin/spark SPARK-6373

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9416.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9416


commit fd2980ab8cc1fc5b4626bb7a0d1e94128ca3874d
Author: turp1twin 
Date:   2015-10-31T20:26:14Z

Merged ssl-shuffle-latest

commit a7f915aecea4492d9a41b7310eb465cb32d7ef14
Author: turp1twin 
Date:   2015-11-01T20:50:09Z

Added new SSL Netty Shuffle test and SSL YarnShuffleService test, cleaned 
up merge issue in TransportServer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as 

[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9417#issuecomment-153059875
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11448][SQL] Skip caching part-files in ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9405#issuecomment-153017380
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9415#issuecomment-153050470
  
**[Test build #44813 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/consoleFull)**
 for PR 9415 at commit 
[`788c1a1`](https://github.com/apache/spark/commit/788c1a1675b1470d09175cdf31f2131e8d4767ac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-153050399
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9415#issuecomment-153050637
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9415#issuecomment-153050635
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

2015-11-02 Thread hvanhovell
GitHub user hvanhovell opened a pull request:

https://github.com/apache/spark/pull/9417

[SPARK-11449][Core] PortableDataStream should be a factory

```PortableDataStream``` maintains some internal state. This makes it 
tricky to reuse a stream (one needs to call ```close``` on both the 
```PortableDataStream``` and the ```InputStream``` it produces).

This PR removes all state from ```PortableDataStream``` and effectively 
turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the 
user responsible for managing the ```InputStream``` it returns.

cc @srowen 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hvanhovell/spark SPARK-11449

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9417.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9417


commit 91b8c6cc86c83df161e176c7a4efbc3dd439d037
Author: Herman van Hovell 
Date:   2015-11-02T15:42:44Z

Removed state from PortableDataStream




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153012486
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10786][SQL]Take the whole statement to ...

2015-11-02 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/8895#issuecomment-153014703
  
LGTM. Thanks for fixing this! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11450][SQL] Add Unsafe Row processing t...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9414#issuecomment-153028650
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/9413#issuecomment-153031528
  
In the current implementation we provide ```Std. Error``` for 
```coefficients``` excepts ```intercept```, because that we use optimized 
method to calculate ```intercept```. If we want to calculate ```Std. Error``` 
for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` 
like
```scala
val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0))
val newAtB = Array.concat(abBar.values, Array(bBar))

val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB)
val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k)
```
I'm afraid that it will cause performance degradation, so I propose output 
```Std. Error``` only for ```coefficients``. May be here we should discuss, or 
figure out better way to output ```Std. Error``` for ```intercept``. @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org




[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...

2015-11-02 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/9398#issuecomment-153050315
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-153050402
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44815/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-153050394
  
**[Test build #44815 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44815/consoleFull)**
 for PR 7753 at commit 
[`a8fcf74`](https://github.com/apache/spark/commit/a8fcf74752e2bfed697280d00383b137b1994ae7).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class ExecutorMetrics extends Serializable `\n  * `case class 
TransportMetrics(`\n  * `class MemoryListener extends SparkListener `\n  * 
`class MemoryUIInfo `\n  * `class TransportMemSize `\n  * `case class 
MemTime(memorySize: Long = 0L, timeStamp: Long = 0L)`\n  * `case class Corr(`\n 
 * `case class Corr(left: Expression, right: Expression)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11455][SQL] fix case sensitivity of par...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9410#issuecomment-153030063
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44806/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/9413#issuecomment-153031344
  
In the current implementation we provide ```Std. Error``` for 
```coefficients``` excepts ```intercept```, because that we use optimized 
method to calculate ```intercept```. If we want to calculate ```Std. Error``` 
for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` 
like
```scala
val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0))
val newAtB = Array.concat(abBar.values, Array(bBar))

val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB)
val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k)
```
I'm afraid that it will cause performance degradation, so I propose output 
```Std. Error``` only for ```coefficients``. May be here we should discuss, or 
figure out better way to output ```Std. Error``` for ```intercept``. @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/9413#issuecomment-153031428
  
In the current implementation we provide ```Std. Error``` for 
```coefficients``` excepts ```intercept```, because that we use optimized 
method to calculate ```intercept```. If we want to calculate ```Std. Error``` 
for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` 
like
```scala
val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0))
val newAtB = Array.concat(abBar.values, Array(bBar))

val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB)
val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k)
```
I'm afraid that it will cause performance degradation, so I propose output 
```Std. Error``` only for ```coefficients``. May be here we should discuss, or 
figure out better way to output ```Std. Error``` for ```intercept``. @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/9413#issuecomment-153031391
  
In the current implementation we provide ```Std. Error``` for 
```coefficients``` excepts ```intercept```, because that we use optimized 
method to calculate ```intercept```. If we want to calculate ```Std. Error``` 
for ```intercept```, we need to concat ```aBar``` array with ```aaBar.values``` 
like
```scala
val newAtA = Array.concat(aaBar.values, summary.aBar.toArray, Array(1.0))
val newAtB = Array.concat(abBar.values, Array(bBar))

val xWithIntercept = CholeskyDecomposition.solve(newAtA, newAtB)
val newAtAi = CholeskyDecomposition.inverse(newAtA, summary.k)
```
I'm afraid that it will cause performance degradation, so I propose output 
```Std. Error``` only for ```coefficients``. May be here we should discuss, or 
figure out better way to output ```Std. Error``` for ```intercept``. @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10533] [SQL] handle scientific notation...

2015-11-02 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9085#discussion_r43633899
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -169,6 +169,26 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(
   testData.filter("key > 90"),
   testData.collect().filter(_.getInt(0) > 90).toSeq)
+
+checkAnswer(
+  testData.filter("key > 9.0e1"),
+  testData.collect().filter(_.getInt(0) > 90).toSeq)
+
+checkAnswer(
+  testData.filter("key > .9e+2"),
+  testData.collect().filter(_.getInt(0) > 90).toSeq)
+
+checkAnswer(
+  testData.filter("key > 0.9e+2"),
+  testData.collect().filter(_.getInt(0) > 90).toSeq)
+
+checkAnswer(
+  testData.filter("key > 900e-1"),
+  testData.collect().filter(_.getInt(0) > 90).toSeq)
+
+checkAnswer(
+  testData.filter("key > 900.0E-1"),
+  testData.collect().filter(_.getInt(0) > 90).toSeq)
--- End diff --

Could you please add a case for literals like `9.e+1`? This should be 
accepted according to the defined parser rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-153048340
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153050846
  
**[Test build #44814 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44814/consoleFull)**
 for PR 9399 at commit 
[`7c17dd1`](https://github.com/apache/spark/commit/7c17dd1adf1d6134a67a07a73f3ffe56d713b6c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...

2015-11-02 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/9398#issuecomment-153057423
  
Thanks @andrewor14 , I'll look into this soon!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9413#issuecomment-153029589
  
**[Test build #44812 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44812/consoleFull)**
 for PR 9413 at commit 
[`655fb43`](https://github.com/apache/spark/commit/655fb436950e44e1783a2bc3767e40a0295ce83f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9415#issuecomment-153040844
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9415#issuecomment-153043608
  
**[Test build #44813 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44813/consoleFull)**
 for PR 9415 at commit 
[`788c1a1`](https://github.com/apache/spark/commit/788c1a1675b1470d09175cdf31f2131e8d4767ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

2015-11-02 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/9417#issuecomment-153059923
  
@kmader what do you think of this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11453][SQL] append data to partitioned ...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9408#issuecomment-153013113
  
**[Test build #44804 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44804/consoleFull)**
 for PR 9408 at commit 
[`b1512b0`](https://github.com/apache/spark/commit/b1512b0bcb80d5621f43954989403a85fdab0960).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Corr(`\n  * `case class Corr(left: Expression, right: 
Expression)`\n  * `case class RepartitionByExpression(`\n  * `
logInfo(s\"Hive class not found $e\")`\n  * `logDebug(\"Hive class not 
found\", e)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153012488
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44811/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153013074
  
**[Test build #44810 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44810/consoleFull)**
 for PR 9412 at commit 
[`7e07efe`](https://github.com/apache/spark/commit/7e07efe701cf9dffaaf8411b108bdd2b3ca99f91).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153012481
  
**[Test build #44811 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44811/consoleFull)**
 for PR 9399 at commit 
[`b658aaa`](https://github.com/apache/spark/commit/b658aaa8f5221ba55c1acbba21e496cc40ff6f45).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/9415

[SPARK-11458][SQL] add word count example for Dataset



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark wordcount

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9415


commit 788c1a1675b1470d09175cdf31f2131e8d4767ac
Author: Wenchen Fan 
Date:   2015-11-02T14:40:52Z

word count example for Dataset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9413#discussion_r43634982
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -471,6 +484,59 @@ class LinearRegressionSummary private[regression] (
 predictions.select(t(col(predictionCol), 
col(labelCol)).as("residuals"))
   }
 
+  lazy val numInstances: Long = predictions.count()
+
+  lazy val dfe = if (model.getFitIntercept) {
+numInstances - model.weights.size -1
+  } else {
+numInstances - model.weights.size
+  }
+
+  lazy val devianceResiduals: Array[Double] = {
+val weighted = if (model.getWeightCol.isEmpty) lit(1.0) else 
sqrt(col(model.getWeightCol))
+val dr = 
predictions.select(col(model.getLabelCol).minus(col(model.getPredictionCol))
+  .multiply(weighted).as("weightedResiduals"))
+  .select(min(col("weightedResiduals")).as("min"), 
max(col("weightedResiduals")).as("max"))
+  .take(1)(0)
+Array(dr.getDouble(0), dr.getDouble(1))
--- End diff --

DataFrame currently does not provide interface to calculate percentile 
(only Hive UDAF), so here we only provide max and min value of deviance 
residuals. [SPARK-9299](https://issues.apache.org/jira/browse/SPARK-9299) works 
on providing ```percentile``` and ```percentile_approx``` aggregate functions, 
after it was resolved we can provide deviance residuals of quantile (0.25, 0.5, 
0.75).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11456] [TESTS] Remove deprecated junit....

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9411#issuecomment-153028631
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44809/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11456] [TESTS] Remove deprecated junit....

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9411#issuecomment-153028628
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153028669
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44810/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11456] [TESTS] Remove deprecated junit....

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9411#issuecomment-153028484
  
**[Test build #44809 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44809/consoleFull)**
 for PR 9411 at commit 
[`058d824`](https://github.com/apache/spark/commit/058d8245467aa36225789b9d0cef8a95e3d970d5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153028668
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9415#issuecomment-153040919
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-6373 Add SSL/TLS for the Netty based Blo...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9416#issuecomment-153058457
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11449][Core] PortableDataStream should ...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9417#issuecomment-153060623
  
**[Test build #1967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1967/consoleFull)**
 for PR 9417 at commit 
[`91b8c6c`](https://github.com/apache/spark/commit/91b8c6cc86c83df161e176c7a4efbc3dd439d037).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11453][SQL] append data to partitioned ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9408#issuecomment-153013228
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11453][SQL] append data to partitioned ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9408#issuecomment-153013229
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44804/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9836] [ML] Provide R-like summary stati...

2015-11-02 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/9413

[SPARK-9836] [ML] Provide R-like summary statistics for OLS via normal 
equation solver

https://issues.apache.org/jira/browse/SPARK-9836

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-9836

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9413


commit 655fb436950e44e1783a2bc3767e40a0295ce83f
Author: Yanbo Liang 
Date:   2015-11-02T14:07:56Z

Provide R-like summary statistics for OLS via normal equation solver




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153028495
  
**[Test build #44810 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44810/consoleFull)**
 for PR 9412 at commit 
[`7e07efe`](https://github.com/apache/spark/commit/7e07efe701cf9dffaaf8411b108bdd2b3ca99f91).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11455][SQL] fix case sensitivity of par...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9410#issuecomment-153029895
  
**[Test build #44806 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44806/consoleFull)**
 for PR 9410 at commit 
[`0f552c4`](https://github.com/apache/spark/commit/0f552c42926bfbd368eb9e19cc93ed625cd67d7f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11311] [SQL] spark cannot describe temp...

2015-11-02 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/9277#issuecomment-153044589
  
LGTM, merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-910...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7753#issuecomment-153048362
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153048329
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153048363
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9398#issuecomment-153051220
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9398#issuecomment-153051172
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153011716
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153011717
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/9412

[SPARK-11457][Streaming][YARN] Fix incorrect AM proxy filter conf recovery 
from checkpoint

Currently Yarn AM proxy filter configuration is recovered from checkpoint 
file when Spark Streaming application is restarted, which will lead to some 
unwanted behaviors:

1. Wrong RM address if RM is redeployed from failure.
2. Wrong proxyBase, since app id is updated, old app id for proxyBase is 
wrong.

So instead of recovering from checkpoint file, these configurations should 
be reloaded each time when app started.

This problem only exists in Yarn cluster mode, for Yarn client mode, these 
configurations will be updated with RPC message `AddWebUIFilter`.

Please help to review @tdas @harishreedharan @vanzin , thanks a lot.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-11457

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9412.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9412


commit 7e07efe701cf9dffaaf8411b108bdd2b3ca99f91
Author: jerryshao 
Date:   2015-11-02T12:30:35Z

Fix Spark Streaming checkpoint with Yarn-cluster configuration recovery 
issue




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153011694
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153011679
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153012036
  
**[Test build #44811 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44811/consoleFull)**
 for PR 9399 at commit 
[`b658aaa`](https://github.com/apache/spark/commit/b658aaa8f5221ba55c1acbba21e496cc40ff6f45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8467][MLlib][PySpark] Add LDAModel.desc...

2015-11-02 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/8643#discussion_r43657655
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/LDAModelWrapper.scala ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.api.python
+
+import org.apache.spark.SparkContext
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.mllib.clustering.LDAModel
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+/**
+ * Wrapper around LDAModel to provide helper methods in Python
+ */
+private[python] class LDAModelWrapper(model: LDAModel) {
+
+  def topicsMatrix(): Matrix = model.topicsMatrix
+
+  def vocabSize(): Int = model.vocabSize
+
+  def describeTopics(jsc: JavaSparkContext): DataFrame = 
describeTopics(this.model.vocabSize, jsc)
+
+  def describeTopics(maxTermsPerTopic: Int, jsc: JavaSparkContext): 
DataFrame = {
+// Since the return value of `describeTopics` is a little complicated,
+// it is converted into `Row` to take advantage of DataFrame 
serialization.
+val sqlContext = new SQLContext(jsc.sc)
+val topics = model.describeTopics(maxTermsPerTopic)
+sqlContext.createDataFrame(topics).toDF("terms", "termWeights")
--- End diff --

Serialize a DataFrame will trigger a Spark job, we could still use Pickle 
to serialize them without DataFrame, via `PythomMLLibAPI.dumps()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9404#issuecomment-153099394
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9404#issuecomment-153099409
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9404#issuecomment-153099382
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9162][SQL] Implement code generation fo...

2015-11-02 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/9270#discussion_r43658099
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -959,6 +963,122 @@ case class ScalaUDF(
   }
   }
 
+  // Generate codes used to convert the arguments to Scala type for 
user-defined funtions
+  private[this] def genCodeForConverter(ctx: CodeGenContext, index: Int): 
String  = {
+val converterClassName = classOf[Any => Any].getName
+val typeConvertersClassName = CatalystTypeConverters.getClass.getName 
+ ".MODULE$"
+val expressionClassName = classOf[Expression].getName
+val scalaUDFClassName = classOf[ScalaUDF].getName
+
+val converterTerm = ctx.freshName("converter" + index.toString)
--- End diff --

Why `index` here? All freshName will be unique. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-11371 Make "mean" an alias for "avg" ope...

2015-11-02 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9332#issuecomment-153103900
  
@ted-yu Can you modify a test to use this alias?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11457][Streaming][YARN] Fix incorrect A...

2015-11-02 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/9412#issuecomment-153115388
  
LGTM as far as I understand this code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10997] [core] Add "client mode" to nett...

2015-11-02 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/9210#issuecomment-153116416
  
Merging this to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10622] [core] [yarn] Differentiate dead...

2015-11-02 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/8887#issuecomment-153117330
  
Hi @kayousterhout , are you ok with my explanation above? I'd like to get 
this is soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-11420 Updating Stddev support via Impera...

2015-11-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9380#discussion_r43662570
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
@@ -1135,7 +992,76 @@ abstract class CentralMomentAgg(child: Expression) 
extends ImperativeAggregate w
   moments(4) = buffer.getDouble(fourthMomentOffset)
 }
 
-getStatistic(n, mean, moments)
+if (n == 0.0) null
+else if (n == 1.0) 0.0
--- End diff --

I don't believe we want this behavior, since these edge cases should be 
handled in the `getStatistic` implementation. If you see [previous 
PR](https://github.com/apache/spark/pull/9003) we established that `Skewness` 
and `Kurtosis` should yield `Double.NaN` when `n == 1.0` but other functions 
like `VariancePop` should yield 0.0. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10622] [core] [yarn] Differentiate dead...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8887#issuecomment-153117844
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9404#discussion_r43662693
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
@@ -353,4 +354,44 @@ class CachedTableSuite extends QueryTest with 
SharedSQLContext {
 assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e 
}.size === 3)
 assert(sparkPlan.collect { case e: PhysicalRDD => e }.size === 0)
   }
+
+  /**
+   * Verifies that the plan for `df` contains `expected` number of 
Exchange operators.
+   */
+  private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = {
+assert(df.queryExecution.executedPlan.collect { case e: Exchange => e 
}.size == expected)
+  }
+
+  test("A cached table preserves the partitioning and ordering of its 
cached SparkPlan") {
+val table3x = testData.unionAll(testData).unionAll(testData)
+table3x.registerTempTable("testData3x")
+
+sql("SELECT key, value FROM testData3x ORDER BY 
key").registerTempTable("orderedTable")
+sqlContext.cacheTable("orderedTable")
+assertCached(sqlContext.table("orderedTable"))
+// Should not have an exchange as the query is already sorted on the 
group by key.
+verifyNumExchanges(sql("SELECT key, count(*) FROM orderedTable GROUP 
BY key"), 0)
+checkAnswer(
+  sql("SELECT key, count(*) FROM orderedTable GROUP BY key ORDER BY 
key"),
+  sql("SELECT key, count(*) FROM testData3x GROUP BY key ORDER BY 
key").collect())
+sqlContext.uncacheTable("orderedTable")
+
+// Set up two tables distributed in the same way.
+testData.distributeBy(Column("key") :: Nil, 5).registerTempTable("t1")
+testData2.distributeBy(Column("a") :: Nil, 5).registerTempTable("t2")
+sqlContext.cacheTable("t1")
+sqlContext.cacheTable("t2")
+
+// Joining them should result in no exchanges.
+verifyNumExchanges(sql("SELECT * FROM t1 t1 JOIN t2 t2 ON t1.key = 
t2.a"), 0)
--- End diff --

ah seems partitioning the data to `5` partitions does the trick at here 
(the default parallelism is set to 5 in our tests). If you change it tom 
something like `10`, this test will fail... Unfortunately, we do not have the 
concept of equivalent class right now. So, at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L229-L240,
 `allCompatible` method does not really do what we want at here (btw, 
`allCompatible` method is trying to make sure that partitioning schemes of all 
children are compatible with each other, i.e. making sure they partition the 
data with the same partitioner and with the same number of partitions).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153119257
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10997] [core] Add "client mode" to nett...

2015-11-02 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/9210#issuecomment-153119171
  
Just took another look. LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153121113
  
**[Test build #44820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/consoleFull)**
 for PR 9383 at commit 
[`53dbdf2`](https://github.com/apache/spark/commit/53dbdf2d4c8c547e6bd50a589bf0223e7ce95e84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10827] [CORE] AppClient should not use ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9317#issuecomment-153121675
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153121656
  
**[Test build #44820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/consoleFull)**
 for PR 9383 at commit 
[`53dbdf2`](https://github.com/apache/spark/commit/53dbdf2d4c8c547e6bd50a589bf0223e7ce95e84).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153121664
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44820/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9415#discussion_r43648991
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.sql
+
+import org.apache.spark.sql.{Dataset, SQLContext}
+import org.apache.spark.{SparkContext, SparkConf}
+
+object DatasetWordCount {
+  def main(args: Array[String]): Unit = {
+val sparkConf = new SparkConf().setAppName("DatasetWordCount")
+val sc = new SparkContext(sparkConf)
+val sqlContext = new SQLContext(sc)
+
+// Importing the SQL context gives access to all the SQL functions and 
implicit conversions.
+import sqlContext.implicits._
+
+val lines: Dataset[String] = Seq("hello world", "say hello to the 
world").toDS()
+val words: Dataset[(String, Int)] = lines.flatMap(_.split(" 
")).map(word => word -> 1)
+val counts: Dataset[(String, Int)] = words.groupBy(_._1).mapGroups {
+  case (word, iter) => Iterator(word -> iter.length)
+}
+
+counts.foreach { case (word, count) => println(s"$word: $count") }
--- End diff --

We should `collect()` here so this doesn't run on the executors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153095366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44814/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9399#issuecomment-153095363
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8467][MLlib][PySpark] Add LDAModel.desc...

2015-11-02 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/8643#discussion_r43658405
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/LDAModelWrapper.scala ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.api.python
+
+import org.apache.spark.SparkContext
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.mllib.clustering.LDAModel
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+/**
+ * Wrapper around LDAModel to provide helper methods in Python
+ */
+private[python] class LDAModelWrapper(model: LDAModel) {
+
+  def topicsMatrix(): Matrix = model.topicsMatrix
+
+  def vocabSize(): Int = model.vocabSize
+
+  def describeTopics(jsc: JavaSparkContext): DataFrame = 
describeTopics(this.model.vocabSize, jsc)
+
+  def describeTopics(maxTermsPerTopic: Int, jsc: JavaSparkContext): 
DataFrame = {
+// Since the return value of `describeTopics` is a little complicated,
+// it is converted into `Row` to take advantage of DataFrame 
serialization.
+val sqlContext = new SQLContext(jsc.sc)
+val topics = model.describeTopics(maxTermsPerTopic)
+sqlContext.createDataFrame(topics).toDF("terms", "termWeights")
--- End diff --

@davies thanks for the comment. Should we rather `PythonMLlibAPI.dmups()` 
than Java Any types like below?

https://github.com/yu-iskw/spark/commit/e1c66d050f7c4edbe1bf4e3b57b145cc62c23630#diff-71f42172be0b5fc14827b7bb31f4e80bR34


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9404#issuecomment-153100427
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9404#issuecomment-153100484
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...

2015-11-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9392#issuecomment-153104431
  
**[Test build #1969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1969/consoleFull)**
 for PR 9392 at commit 
[`a7c395f`](https://github.com/apache/spark/commit/a7c395f9fe7bf43b1e63af060b425fa6047b25f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...

2015-11-02 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/9392#issuecomment-153104500
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9276#issuecomment-153106533
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9817][YARN] Improve the locality calcul...

2015-11-02 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/8100#issuecomment-153115647
  
Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11425] Improve Hybrid aggregation

2015-11-02 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/9383#issuecomment-153118810
  
After some benchmark, realized that using hashcode as prefix in timsort 
will cause regression in timsort and snappy compression (especially for 
aggregation after join, the order of records will become random). I will revert 
that part.

benchmark code:
```
sqlContext.setConf("spark.sql.shuffle.partitions", "1")
N = 1<<25
M = 1<<20
df = sqlContext.range(N).selectExpr("id", "repeat(id, 2) as s")
df.show()
df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
j = df.join(df2, df.id==df2.id2).groupBy(df.s).max("id", "id2")
n = j.count()
```

Another interesting finding is that Snappy will slowdown the spilling by 
50% of end-to-end time, LZ4 will be faster than Snappy, but still 10% slower 
than not-compressed. Should we use `false` as the default value for 
`spark.shuffle.spill.compress`?(PS: tested on Mac with SSD, it may not be true 
on HD)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8029][core] first successful shuffle ta...

2015-11-02 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9214#discussion_r43664109
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
 ---
@@ -121,13 +125,22 @@ public BypassMergeSortShuffleWriter(
   }
 
   @Override
-  public void write(Iterator> records) throws IOException {
+  public Seq> write(Iterator> records) 
throws IOException {
 assert (partitionWriters == null);
+final File indexFile = shuffleBlockResolver.getIndexFile(shuffleId, 
mapId);
+final File dataFile = shuffleBlockResolver.getDataFile(shuffleId, 
mapId);
 if (!records.hasNext()) {
   partitionLengths = new long[numPartitions];
-  shuffleBlockResolver.writeIndexFile(shuffleId, mapId, 
partitionLengths);
+  final File tmpIndexFile = 
shuffleBlockResolver.writeIndexFile(shuffleId, mapId, partitionLengths);
   mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), 
partitionLengths);
-  return;
+  // create empty data file so we always commit same set of shuffle 
output files, even if
+  // data is non-deterministic
+  final File tmpDataFile = 
blockManager.diskBlockManager().createTempShuffleBlock()._2();
+  tmpDataFile.createNewFile();
--- End diff --

Check the return value? This method doesn't throw on error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8029][core] first successful shuffle ta...

2015-11-02 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9214#discussion_r43664703
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java
 ---
@@ -121,13 +125,22 @@ public BypassMergeSortShuffleWriter(
   }
 
   @Override
-  public void write(Iterator> records) throws IOException {
+  public Seq> write(Iterator> records) 
throws IOException {
--- End diff --

Could you add a javadoc explaining what the return value is? It's 
particularly cryptic because it's uses tuples; maybe it would be better to 
create a helper type where the fields have proper names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11437] [PySpark] Don't .take when conve...

2015-11-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9392


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8029][core] first successful shuffle ta...

2015-11-02 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9214#discussion_r43664936
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/FileShuffleBlockResolver.scala ---
@@ -132,6 +134,15 @@ private[spark] class FileShuffleBlockResolver(conf: 
SparkConf)
 logWarning(s"Error deleting ${file.getPath()}")
   }
 }
+for (mapId <- state.completedMapTasks.asScala) {
+  val mapStatusFile =
+
blockManager.diskBlockManager.getFile(ShuffleMapStatusBlockId(shuffleId, mapId))
+  if (mapStatusFile.exists()) {
+if (!mapStatusFile.delete()) {
--- End diff --

nit: you could merge the two `if`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11438] [SQL] Allow users to define nond...

2015-11-02 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9393#discussion_r43657874
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -191,4 +193,86 @@ class UDFSuite extends QueryTest with SharedSQLContext 
{
 // pass a decimal to intExpected.
 assert(sql("SELECT intExpected(1.0)").head().getInt(0) === 1)
   }
+
+  private def checkNumUDFs(df: DataFrame, expectedNumUDFs: Int): Unit = {
+val udfs = df.queryExecution.optimizedPlan.collect {
+  case p: logical.Project => p.projectList.flatMap {
+case e => e.collect {
+  case udf: ScalaUDF => udf
+}
+  }
+}.flatMap(functions => functions)
+assert(udfs.length === expectedNumUDFs)
+  }
+
+  test("nondeterministic udf: using UDFRegistration") {
+import org.apache.spark.sql.functions._
+
+val deterministicUDF = sqlContext.udf.register("plusOne1", (x: Int) => 
x + 1)
+val nondeterministicUDF = deterministicUDF.nonDeterministic
+sqlContext.udf.register("plusOne2", nondeterministicUDF)
+
+{
+  val df = sql("SELECT 1 as a")
+.select(col("a"), deterministicUDF(col("a")).as("b"))
+.select(col("a"), col("b"), deterministicUDF(col("b")).as("c"))
+  checkNumUDFs(df, 3)
--- End diff --

The default value of `foldable` is false, which is why we see three 
expressions at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add ...

2015-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9276#issuecomment-153106415
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5354] [SQL] Cached tables should preser...

2015-11-02 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9404#discussion_r43660128
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
@@ -353,4 +354,44 @@ class CachedTableSuite extends QueryTest with 
SharedSQLContext {
 assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e 
}.size === 3)
 assert(sparkPlan.collect { case e: PhysicalRDD => e }.size === 0)
   }
+
+  /**
+   * Verifies that the plan for `df` contains `expected` number of 
Exchange operators.
+   */
+  private def verifyNumExchanges(df: DataFrame, expected: Int): Unit = {
+assert(df.queryExecution.executedPlan.collect { case e: Exchange => e 
}.size == expected)
+  }
+
+  test("A cached table preserves the partitioning and ordering of its 
cached SparkPlan") {
+val table3x = testData.unionAll(testData).unionAll(testData)
+table3x.registerTempTable("testData3x")
+
+sql("SELECT key, value FROM testData3x ORDER BY 
key").registerTempTable("orderedTable")
+sqlContext.cacheTable("orderedTable")
+assertCached(sqlContext.table("orderedTable"))
+// Should not have an exchange as the query is already sorted on the 
group by key.
+verifyNumExchanges(sql("SELECT key, count(*) FROM orderedTable GROUP 
BY key"), 0)
+checkAnswer(
+  sql("SELECT key, count(*) FROM orderedTable GROUP BY key ORDER BY 
key"),
+  sql("SELECT key, count(*) FROM testData3x GROUP BY key ORDER BY 
key").collect())
+sqlContext.uncacheTable("orderedTable")
+
+// Set up two tables distributed in the same way.
+testData.distributeBy(Column("key") :: Nil, 5).registerTempTable("t1")
+testData2.distributeBy(Column("a") :: Nil, 5).registerTempTable("t2")
+sqlContext.cacheTable("t1")
+sqlContext.cacheTable("t2")
+
+// Joining them should result in no exchanges.
+verifyNumExchanges(sql("SELECT * FROM t1 t1 JOIN t2 t2 ON t1.key = 
t2.a"), 0)
+
+// Grouping on the partition key should result in no exchanges
+verifyNumExchanges(sql("SELECT count(*) FROM t1 GROUP BY key"), 0)
+
+// TODO: this is an issue with self joins. The number of exchanges 
should be 0.
+verifyNumExchanges(sql("SELECT * FROM t1 t1 JOIN t1 t2 on t1.key = 
t2.key"), 1)
+
+sqlContext.uncacheTable("t1")
+sqlContext.uncacheTable("t2")
+  }
--- End diff --

How about we also check the results?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >