[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1151#issuecomment-47620833
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread baishuo
Github user baishuo commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47621874
  
I  add some synchronized please see if it is thread safe,and Jenkins 
should test this once more


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2332 [build] add exclusion for old servl...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1271#issuecomment-47621924
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16278/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...

2014-07-01 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1151#issuecomment-47621967
  
Hey @YanjieGao, you need to resolve all those conflicts by hand first and 
then `git add` changed files. These two interactive online Git tutorial might 
be very helpful :-)

1. http://pcottle.github.io/learnGitBranching/
1. https://try.github.io/levels/1/challenges/1

And thanks for working on this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2332 [build] add exclusion for old servl...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1271#issuecomment-47621923
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Workaround in Spark for ConcurrentModification...

2014-07-01 Thread colorant
Github user colorant commented on the pull request:

https://github.com/apache/spark/pull/1000#issuecomment-47622801
  
It seems that this workaround not works for me on Hadoop 2.2.0, I still hit 
into this problem from within the synchronized block with the latest trunk code:

java.util.ConcurrentModificationException 
(java.util.ConcurrentModificationException}
java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
java.util.HashMap$KeyIterator.next(HashMap.java:828)
java.util.AbstractCollection.addAll(AbstractCollection.java:305)
java.util.HashSet.init(HashSet.java:100)
org.apache.hadoop.conf.Configuration.init(Configuration.java:554)
org.apache.hadoop.mapred.JobConf.init(JobConf.java:439)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:189)
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:184)
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:59)
org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Workaround in Spark for ConcurrentModification...

2014-07-01 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1000#issuecomment-47622999
  
@colorant if you can look into it and submit a fix, that'd be great!
 Thanks for reporting this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47624045
  
This is indeed threadsafe, but perhaps overeager. I think we should aim for 
get()s to be relatively fast, and I think we can avoid extra synchronization 
there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14391299
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -50,11 +50,12 @@ trait SQLConf {
   /** ** SQLConf functionality methods  */
 
   @transient
-  private val settings = java.util.Collections.synchronizedMap(
-new java.util.HashMap[String, String]())
+  private val settings = new 
java.util.concurrent.ConcurrentHashMap[String, String]()
 
   def set(props: Properties): Unit = {
-props.asScala.foreach { case (k, v) = this.settings.put(k, v) }
+settings.synchronized {
--- End diff --

Perhaps we can remove this synchronized, I don't think we care about the 
consistency guarantees of inserting multiple properties at once :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14391324
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -64,20 +65,21 @@ trait SQLConf {
   }
 
   def get(key: String): String = {
-if (!settings.containsKey(key)) {
-  throw new NoSuchElementException(key)
-}
 settings.get(key)
   }
 
   def get(key: String, defaultValue: String): String = {
-if (!settings.containsKey(key)) defaultValue else settings.get(key)
+settings.synchronized {
+  if (!settings.containsKey(key)) defaultValue else settings.get(key)
--- End diff --

Let's use the ConcurrentHashMap-safe paradigm of

```scala
Option(settings.get(key)).getOrElse(defaultValue)
```

Note that ConcurrentHashMap does not allow null values, so this is safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14391342
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -64,20 +65,21 @@ trait SQLConf {
   }
 
   def get(key: String): String = {
-if (!settings.containsKey(key)) {
-  throw new NoSuchElementException(key)
-}
 settings.get(key)
   }
 
   def get(key: String, defaultValue: String): String = {
-if (!settings.containsKey(key)) defaultValue else settings.get(key)
+settings.synchronized {
+  if (!settings.containsKey(key)) defaultValue else settings.get(key)
+}
   }
 
   def getAll: Array[(String, String)] = settings.asScala.toArray
 
   def getOption(key: String): Option[String] = {
-if (!settings.containsKey(key)) None else Some(settings.get(key))
+settings.synchronized {
+  if (!settings.containsKey(key)) None else Some(settings.get(key))
--- End diff --

Similarly, here, just 

```scala
Option(settings.get(key))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47624200
  
With the new `synchronized`s you added in usage, I don't think we need 
`ConcurrentHashMap` any more. Maybe just a simple `HashMap` is enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14391366
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -64,20 +65,21 @@ trait SQLConf {
   }
 
   def get(key: String): String = {
-if (!settings.containsKey(key)) {
-  throw new NoSuchElementException(key)
-}
 settings.get(key)
   }
 
   def get(key: String, defaultValue: String): String = {
-if (!settings.containsKey(key)) defaultValue else settings.get(key)
+settings.synchronized {
+  if (!settings.containsKey(key)) defaultValue else settings.get(key)
--- End diff --

(Note that switching to this allows us to avoid adding the synchronized {} 
blocks.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2332 [build] add exclusion for old servl...

2014-07-01 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1271#issuecomment-47625138
  
Thanks. I'm merging this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2332 [build] add exclusion for old servl...

2014-07-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1271


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread baishuo
Github user baishuo commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47625269
  
thanks @aarondav ,had modified according to your comment,please help me 
to check if it is proper


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1151#issuecomment-47625603
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16279/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2234][SQL]Spark SQL basicOperators add ...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1151#issuecomment-47625602
  
Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47625708
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47625713
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47626045
  
Can you undo the indent spacing change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14391912
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -50,11 +50,10 @@ trait SQLConf {
   /** ** SQLConf functionality methods  */
 
   @transient
-  private val settings = java.util.Collections.synchronizedMap(
-new java.util.HashMap[String, String]())
+  private val settings = new 
java.util.concurrent.ConcurrentHashMap[String, String]()
 
   def set(props: Properties): Unit = {
-props.asScala.foreach { case (k, v) = this.settings.put(k, v) }
+  props.asScala.foreach { case (k, v) = this.settings.put(k, v) }
--- End diff --

can you reset the change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14392109
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -64,20 +63,17 @@ trait SQLConf {
   }
 
   def get(key: String): String = {
-if (!settings.containsKey(key)) {
--- End diff --

Probably adding the checking logic is better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47626937
  
And I also saw the code:
```
def toDebugString: String = {
settings.synchronized {
  settings.asScala.toArray.sorted.map{ case (k, v) = s$k=$v 
}.mkString(\n)
}
  }
```
Should we remove the synchronized block also? since we use the 
ConcurrentHashMap instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14392261
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -50,8 +50,7 @@ trait SQLConf {
   /** ** SQLConf functionality methods  */
 
   @transient
-  private val settings = java.util.Collections.synchronizedMap(
-new java.util.HashMap[String, String]())
+  private val settings = new 
java.util.concurrent.ConcurrentHashMap[String, String]()
--- End diff --

We should not be using ConcurrentHashMap because this will be a very low 
contention code path. For low contention code path, ConcurrentHashMap is a very 
poor choice (as a matter of fact it'll likely be much slower than synchronized, 
and use a lot more memory)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread baishuo
Github user baishuo commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47627343
  
Hi,@rxin ,had remove indent spacing on 
def set(props: Properties): Unit = {
props.asScala.foreach { case (k, v) = this.settings.put(k, v) }
  }
please help me to check if it is proper


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread baishuo
Github user baishuo commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14392768
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -50,8 +50,7 @@ trait SQLConf {
   /** ** SQLConf functionality methods  */
 
   @transient
-  private val settings = java.util.Collections.synchronizedMap(
-new java.util.HashMap[String, String]())
+  private val settings = new 
java.util.concurrent.ConcurrentHashMap[String, String]()
--- End diff --

undo to Collections.synchronizedMap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Workaround in Spark for ConcurrentModification...

2014-07-01 Thread colorant
Github user colorant commented on the pull request:

https://github.com/apache/spark/pull/1000#issuecomment-47628775
  
@rxin correct me if I am wrong. 

The problem here is that the broadcastedConf is in per task HadoopRDD, 
synchronized on the method or on the broadcastedConf itself is good within this 
task. while when you call braodcastedConf.value.value, you actually return the 
value saved in the memory store,( when memory is enough and with deserialize 
approaching) this conf object should be the same one per node? say when getconf 
across task, you don't prevent to get the same conf object. and pass this conf 
object to JobConf(conf) lead to this problem.

If I am right, then, broadcastedConf.value.value.synchronized might solve 
this problem? 

I am not 100% sure those reference across task staffs did work as I 
described above. What do you think about it? I will try to modify the code and 
see if it works, If this is true, I can do a quick pull request then


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1097] Workaround Hadoop conf Concurrent...

2014-07-01 Thread colorant
GitHub user colorant opened a pull request:

https://github.com/apache/spark/pull/1273

[SPARK-1097] Workaround Hadoop conf ConcurrentModification issue

Workaround Hadoop conf ConcurrentModification issue



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/colorant/spark hadoopRDD

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1273


commit 37c13b30a80793e05dd2300f9accbc29db17a336
Author: Raymond Liu raymond@intel.com
Date:   2014-07-01T08:33:33Z

Workaround Hadoop conf ConcurrentModification issue




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1097] Workaround Hadoop conf Concurrent...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1273#issuecomment-47631084
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1097] Workaround Hadoop conf Concurrent...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1273#issuecomment-47631092
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1097] Workaround Hadoop conf Concurrent...

2014-07-01 Thread colorant
Github user colorant commented on the pull request:

https://github.com/apache/spark/pull/1273#issuecomment-47631094
  
as described in #1000 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Workaround in Spark for ConcurrentModification...

2014-07-01 Thread colorant
Github user colorant commented on the pull request:

https://github.com/apache/spark/pull/1000#issuecomment-47631409
  
@rxin, PR at #1273 , I tried for around 10 batches of job with that patch, 
do not see this problem happen again. without this patch, on my nodes, it do 
happen from time to time, say every 1-3 jobs will meet this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2185] Emit warning when task size excee...

2014-07-01 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1149#issuecomment-47632471
  
Thanks. I've merged this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2185] Emit warning when task size excee...

2014-07-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1149


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47634108
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16280/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47634107
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread baishuo
Github user baishuo commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47634643
  
hi @rxin, how to modify is proper?
use settings.sycnronize {
   ...
}
to ensure the thread safe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1097] Workaround Hadoop conf Concurrent...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1273#issuecomment-47635045
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16281/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1097] Workaround Hadoop conf Concurrent...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1273#issuecomment-47635044
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47635811
  
Oops, passed #1268 related errors, but others failed...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1274

[SPARK-2324] SparkContext should not exit directly when spark.local.dir is 
a list of multiple paths and one of them has error

The spark.local.dir is configured as a list of multiple paths as follows 
/data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver 
node has error, the application will exit since DiskBlockManager exits directly 
at createLocalDirs. If the disk data2 of the worker node has error, the 
executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of 
spark.local.dir has error. Since spark.local.dir has multiple paths, a problem 
should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2324

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1274


commit df086731952c669e12673fd673d829b9fdd790a2
Author: yantangzhai tyz0...@163.com
Date:   2014-07-01T10:39:46Z

[SPARK-2324] SparkContext should not exit directly when spark.local.dir is 
a list of multiple paths and one of them has error




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Use the Executor's ClassLoader in sc.objectFil...

2014-07-01 Thread darabos
Github user darabos commented on the pull request:

https://github.com/apache/spark/pull/181#issuecomment-47647756
  
I was so slow, Bogdan has already fixed this in #821. Anyway, here's the 
belated test. It's probably still useful to avoid regressions. I tested the 
test by reverting Bogdan's change, and the test then fails with 
`ClassNotFoundException: FileSuiteObjectFileTest`. Either both the fix and the 
test are correct, or they are both bugged :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1272#discussion_r14407762
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -50,8 +50,7 @@ trait SQLConf {
   /** ** SQLConf functionality methods  */
 
   @transient
-  private val settings = java.util.Collections.synchronizedMap(
-new java.util.HashMap[String, String]())
+  private val settings = new 
java.util.concurrent.ConcurrentHashMap[String, String]()
--- End diff --

@rxin I think the performance distinction is extremely minor in this case, 
as there is only one ConcurrentHashMap. ConcurrentHashMap's API tends to be 
nicer to use, though, as people may not realize that iteration over a 
SynchronizedMap is not threadsafe, like in the current implementation of 
SQLConf.

As @baishuo mentioned, if we use synchronizedMap we'll have to add 
settings.synchronized {} in a few places now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] SPARK-2329 Add multi-label evaluation ...

2014-07-01 Thread avulanov
Github user avulanov commented on a diff in the pull request:

https://github.com/apache/spark/pull/1270#discussion_r14409556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.scala 
---
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext._
+
+/**
+ * Evaluator for multilabel classification.
+ * NB: type Double both for prediction and label is retained
+ * for compatibility with model.predict that returns Double
+ * and MLUtils.loadLibSVMFile that loads class labels as Double
+ *
+ * @param predictionAndLabels an RDD of (predictions, labels) pairs, both 
are non-null sets.
+ */
+class MultilabelMetrics(predictionAndLabels:RDD[(Set[Double], 
Set[Double])]) extends Logging{
+
+  private lazy val numDocs = predictionAndLabels.count
+
+  private lazy val numLabels = predictionAndLabels.flatMap{case(_, labels) 
= labels}.distinct.count
+
+  /**
+   * Returns strict Accuracy
+   * (for equal sets of labels)
+   * @return strictAccuracy.
+   */
+  lazy val strictAccuracy = predictionAndLabels.filter{case(predictions, 
labels) =
+predictions == labels}.count.toDouble / numDocs
+
+  /**
+   * Returns Accuracy
+   * @return Accuracy.
+   */
+  lazy val accuracy = predictionAndLabels.map{ case(predictions, labels) =
+labels.intersect(predictions).size.toDouble / 
labels.union(predictions).size}.
--- End diff --

Do you suggest to extract labels.intersect(predictions).size as a lazy 
val? Will it then be calculated only once? The operation is made with Scala Set 
(not with RDD). Another option might be to store in RDD all intermediate 
calculations (including intersection) that are used in six different measures. 
In this case, I will need to make fold on the six-element tuple, which will 
look kind of scary. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] SPARK-2329 Add multi-label evaluation ...

2014-07-01 Thread avulanov
Github user avulanov commented on a diff in the pull request:

https://github.com/apache/spark/pull/1270#discussion_r14409643
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.scala 
---
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext._
+
+/**
+ * Evaluator for multilabel classification.
+ * NB: type Double both for prediction and label is retained
+ * for compatibility with model.predict that returns Double
+ * and MLUtils.loadLibSVMFile that loads class labels as Double
+ *
+ * @param predictionAndLabels an RDD of (predictions, labels) pairs, both 
are non-null sets.
+ */
+class MultilabelMetrics(predictionAndLabels:RDD[(Set[Double], 
Set[Double])]) extends Logging{
+
+  private lazy val numDocs = predictionAndLabels.count
+
+  private lazy val numLabels = predictionAndLabels.flatMap{case(_, labels) 
= labels}.distinct.count
+
+  /**
+   * Returns strict Accuracy
+   * (for equal sets of labels)
+   * @return strictAccuracy.
+   */
+  lazy val strictAccuracy = predictionAndLabels.filter{case(predictions, 
labels) =
+predictions == labels}.count.toDouble / numDocs
+
+  /**
+   * Returns Accuracy
+   * @return Accuracy.
+   */
+  lazy val accuracy = predictionAndLabels.map{ case(predictions, labels) =
+labels.intersect(predictions).size.toDouble / 
labels.union(predictions).size}.
+fold(0.0)(_ + _) / numDocs
+
--- End diff --

The fold operation is made with RDD. I didn't find sum in the RDD 
interface, that's why I used fold. I will be happy to use sum instead. 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47672298
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47672282
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread CodingCat
GitHub user CodingCat opened a pull request:

https://github.com/apache/spark/pull/1275

update the comments in SqlParser

SqlParser has been case-insensitive after 
https://github.com/apache/spark/commit/dab5439a083b5f771d5d5b462d0d517fa8e9aaf2 
was merged

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CodingCat/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1275


commit 17931cd5cc79b406104b2e99f3131aa833e360ce
Author: CodingCat zhunans...@gmail.com
Date:   2014-07-01T16:27:39Z

update the comments in SqlParser




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47677356
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2003] Fix python SparkContext example

2014-07-01 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/1246#issuecomment-47679545
  
fyi, this pull request does not change the doc re setMaster


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1782: svd for sparse matrix using ARPACK

2014-07-01 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/964#issuecomment-47681084
  
@vrilleup Just checked Matlab’s svd and svds. I don’t remember I have 
used options.{tol, maxit} before. I wonder whether this is useful to expose to 
users. I did use RCOND before because I needed to compute very accurate 
solution. But that work was purely academic. In MLlib’s implementation, we 
take the A^T A approach, which couldn’t give us very accurate small singular 
values if the matrix is ill-conditioned. So this is not useful either. My 
suggestion for the type signature is simply:

~~~
def computeSVD(k: Int, computeU: Boolean)
~~~

Let’s estimate the complexity of the dense approach and the iterative 
approach and decide which to use internally. We can open advanced options 
later, e.g. rcond, iter, method: {dense, arpack}, etc. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2334] fix rdd.id() Attribute Error

2014-07-01 Thread dianacarroll
GitHub user dianacarroll opened a pull request:

https://github.com/apache/spark/pull/1276

[SPARK-2334] fix rdd.id() Attribute Error

rdd.id() was returning an Attribute Error in some cases because self._id is 
not getting set.  So instead of returning the _id attribute, return the value 
of id() from the jrdd.  Fixes bug SPARK-2334.
Test with: sc.parallelize([1,2,3]).map(lambda x: x+1).id()


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dianacarroll/spark SPARK-2334

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1276


commit a69e6a9dde47b2eb8bb99e4a67013c107df80eb9
Author: Diana Carroll dcarr...@cloudera.com
Date:   2014-07-01T17:08:45Z

rdd.id(): return id of underlying jrdd

In some cases self._id is not getting set and calls to id() are therefore 
resulting in an AttributeError. This change fixes that by returning the id of 
the underlying jrdd instead.
Test case: sc.parallelize([1,2,3]).map(lambda x: x+1).id()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2334] fix rdd.id() Attribute Error

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1276#issuecomment-47683485
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-07-01 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47683286
  
We benchmarked treeReduce in our random forest implementation, and since 
the trees generated from each partition are fairly large (more than 100MB), we 
found that treeReduce can significantly reduce the shuffle time from 6mins to 
2mins. Nice work! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47684501
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47684503
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16283/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47684505
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16282/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2327] [SQL] Fix nullabilities of Join/G...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1266#issuecomment-47684504
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-07-01 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47686100
  
@dbtsai Thanks for testing it! I'm going to move `treeReduce` and 
`treeAggregate` to `mllib.rdd.RDDFunctions`. For normal data processing, people 
generally use more partitions than number of cores. In those cases, the driver 
can collect task result while other tasks are running. This is not the optimal 
case for machine learning algorithms. So I think we can keep `treeReduce` and 
`treeAggregate` in mllib for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1982 regression: saveToParquetFile doesn...

2014-07-01 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/934#issuecomment-47688117
  
True. Thanks for reminding. Closing this now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1982 regression: saveToParquetFile doesn...

2014-07-01 Thread AndreSchumacher
Github user AndreSchumacher closed the pull request at:

https://github.com/apache/spark/pull/934


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1274#discussion_r14418731
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -115,8 +121,9 @@ private[spark] class DiskBlockManager(shuffleManager: 
ShuffleBlockManager, rootD
 
   private def createLocalDirs(): Array[File] = {
 logDebug(sCreating local directories at root dirs '$rootDirs')
+val localDirsResult = ArrayBuffer[File]()
 val dateFormat = new SimpleDateFormat(MMddHHmmss)
-rootDirs.split(,).map { rootDir =
+rootDirs.split(,).foreach { rootDir =
--- End diff --

Scala style thing, you can use flatMap instead of foreach here and return 
None in the case where directory creation failed and Some(localDir) in the case 
where it worked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2334] fix rdd.id() Attribute Error

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1276#issuecomment-47688685
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16284/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2334] fix rdd.id() Attribute Error

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1276#issuecomment-47688683
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1274#discussion_r14418861
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -137,11 +144,12 @@ private[spark] class DiskBlockManager(shuffleManager: 
ShuffleBlockManager, rootD
   }
   if (!foundLocalDir) {
 logError(sFailed $MAX_DIR_CREATION_ATTEMPTS attempts to create 
local dir in $rootDir)
--- End diff --

Maybe add to this log to say that you are ignoring this directory moving 
forward. e.g., something simple like,

```scala
logError(sFailed $MAX_DIR_CREATION_ATTEMPTS attempts to create local dir 
in $rootDir.
+  Ignoring this directory.)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2174][MLLIB] treeReduce and treeAggrega...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47688785
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2174][MLLIB] treeReduce and treeAggrega...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47688798
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1274#discussion_r14418939
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -26,6 +26,8 @@ import org.apache.spark.executor.ExecutorExitCode
 import org.apache.spark.network.netty.{PathResolver, ShuffleSender}
 import org.apache.spark.util.Utils
 
+import scala.collection.mutable.ArrayBuffer
--- End diff --

nit: this scala.* import should go into their own block in between the 
java.* and org.* imports. See our [style 
guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1274#issuecomment-47689007
  
Jenkins, ok to to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2324] SparkContext should not exit dire...

2014-07-01 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1274#issuecomment-47690237
  
This change seems reasonable because on large clusters, we occasionally see 
a single disk on a single machine is failed, and this may cause the entire 
application to crash because the executor will keep getting restarted until the 
Master kills the application.

It also allows a more uniform configuration for heterogeneous cluster with 
different numbers of disks.

The downside of this behavioral change is that a misconfiguration like 
mistyping one of your local dirs may go unnoticed for a while, but this will 
hopefully become apparent after a `df` or a look at any of the executor logs. 
This fail-fast approach is generally better, but current Spark does not do a 
good job communicating the reason for executors that crash immediately upon 
startup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2174][MLLIB] treeReduce and treeAggrega...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47693931
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2174][MLLIB] treeReduce and treeAggrega...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47693932
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16285/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [HOTFIX] Synchronize on SQLContext.settings in...

2014-07-01 Thread concretevitamin
GitHub user concretevitamin opened a pull request:

https://github.com/apache/spark/pull/1277

[HOTFIX] Synchronize on SQLContext.settings in tests.

Let's see if this fixes the ongoing series of test failures in a master 
build machine 
(https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT-pre-YARN/SPARK_HADOOP_VERSION=1.0.4,label=centos/81/).

@pwendell @marmbrus

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/concretevitamin/spark test-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1277.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1277


commit 28c88bd3aca1336025dbb808e3fdf6ad7ed688ab
Author: Zongheng Yang zonghen...@gmail.com
Date:   2014-07-01T18:42:26Z

Synchronize on SQLContext.settings in tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [HOTFIX] Synchronize on SQLContext.settings in...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1277#issuecomment-47694608
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [HOTFIX] Synchronize on SQLContext.settings in...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1277#issuecomment-47694618
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2337] SQL String Interpolation

2014-07-01 Thread ahirreddy
GitHub user ahirreddy opened a pull request:

https://github.com/apache/spark/pull/1278

[SPARK-2337] SQL String Interpolation

```scala
val sqlContext = new SQLContext(...)
import sqlContext._

case class Person(name: String, age: Int)
val people: RDD[Person] = ...
val srdd = sqlSELECT * FROM $people
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ahirreddy/spark sql-interpolation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1278


commit 99d158ec3a47d8a48e38f522607a4c4539027ea2
Author: Ahir Reddy ahirre...@gmail.com
Date:   2014-07-01T18:32:47Z

Comment

commit b8fea95ffe190bd353e3a75be7af56184680d8aa
Author: Ahir Reddy ahirre...@gmail.com
Date:   2014-07-01T18:44:51Z

Added to comment




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47695864
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47696331
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2337] SQL String Interpolation

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1278#issuecomment-47696328
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2337] SQL String Interpolation

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1278#issuecomment-47696347
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47696348
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47703609
  
I just took a look at the running jenkinsstill a lot of errorsweird


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [HOTFIX] Synchronize on SQLContext.settings in...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1277#issuecomment-47705529
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16286/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [HOTFIX] Synchronize on SQLContext.settings in...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1277#issuecomment-47705528
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2337] SQL String Interpolation

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1278#issuecomment-47707177
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47707179
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16288/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2337] SQL String Interpolation

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1278#issuecomment-47707181
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16287/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update the comments in SqlParser

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1275#issuecomment-47707178
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] SPARK-2329 Add multi-label evaluation ...

2014-07-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1270#discussion_r14428136
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.scala 
---
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext._
+
+/**
+ * Evaluator for multilabel classification.
+ * NB: type Double both for prediction and label is retained
+ * for compatibility with model.predict that returns Double
+ * and MLUtils.loadLibSVMFile that loads class labels as Double
+ *
+ * @param predictionAndLabels an RDD of (predictions, labels) pairs, both 
are non-null sets.
+ */
+class MultilabelMetrics(predictionAndLabels:RDD[(Set[Double], 
Set[Double])]) extends Logging{
+
+  private lazy val numDocs = predictionAndLabels.count
+
+  private lazy val numLabels = predictionAndLabels.flatMap{case(_, labels) 
= labels}.distinct.count
+
+  /**
+   * Returns strict Accuracy
+   * (for equal sets of labels)
+   * @return strictAccuracy.
+   */
+  lazy val strictAccuracy = predictionAndLabels.filter{case(predictions, 
labels) =
+predictions == labels}.count.toDouble / numDocs
+
+  /**
+   * Returns Accuracy
+   * @return Accuracy.
+   */
+  lazy val accuracy = predictionAndLabels.map{ case(predictions, labels) =
+labels.intersect(predictions).size.toDouble / 
labels.union(predictions).size}.
+fold(0.0)(_ + _) / numDocs
+
--- End diff --

Ah, `sum` is defined in `DoubleRDDFunctions`. But looking at the `map` 
call, it seems like it would produce an `RDD[Double]`? I would think you can 
call `sum`, if you import `org.apache.spark.rdd.DoubleFunctions` maybe? Up to 
you what you like better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2159: Add support for stopping SparkCont...

2014-07-01 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1230#issuecomment-47708185
  
Yeah I should have said `exit` command and not functionality. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] SPARK-2329 Add multi-label evaluation ...

2014-07-01 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/1270#discussion_r14428640
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.scala 
---
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext._
+
+/**
+ * Evaluator for multilabel classification.
+ * NB: type Double both for prediction and label is retained
+ * for compatibility with model.predict that returns Double
+ * and MLUtils.loadLibSVMFile that loads class labels as Double
+ *
+ * @param predictionAndLabels an RDD of (predictions, labels) pairs, both 
are non-null sets.
+ */
+class MultilabelMetrics(predictionAndLabels:RDD[(Set[Double], 
Set[Double])]) extends Logging{
+
+  private lazy val numDocs = predictionAndLabels.count
+
+  private lazy val numLabels = predictionAndLabels.flatMap{case(_, labels) 
= labels}.distinct.count
+
+  /**
+   * Returns strict Accuracy
+   * (for equal sets of labels)
+   * @return strictAccuracy.
+   */
+  lazy val strictAccuracy = predictionAndLabels.filter{case(predictions, 
labels) =
+predictions == labels}.count.toDouble / numDocs
+
+  /**
+   * Returns Accuracy
+   * @return Accuracy.
+   */
+  lazy val accuracy = predictionAndLabels.map{ case(predictions, labels) =
+labels.intersect(predictions).size.toDouble / 
labels.union(predictions).size}.
+fold(0.0)(_ + _) / numDocs
+
--- End diff --

After ``import org.apache.spark.SparkContext._``, it should already be 
there as an implicit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2165] spark on yarn: add support for se...

2014-07-01 Thread knusbaum
GitHub user knusbaum opened a pull request:

https://github.com/apache/spark/pull/1279

[SPARK-2165] spark on yarn: add support for setting maxAppAttempts in the 
ApplicationSubmissionContext



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/knusbaum/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1279


commit 41e8a394cd74e42f2228eb880442cb0d6902f275
Author: Kyle Nusbaum knusb...@yahoo-inc.com
Date:   2014-06-24T20:19:16Z

Testing

commit c2a2b69b623a792bc3e7e1e278a2be2668573632
Author: Kyle Nusbaum knusb...@yahoo-inc.com
Date:   2014-07-01T20:46:35Z

Preparing for pull

commit b69955080537bebccc1f2e4bf05ee584a1e429f9
Author: Kyle Nusbaum knusb...@yahoo-inc.com
Date:   2014-07-01T20:48:44Z

Merge remote-tracking branch 'community/master'

commit 2532b6755ff2876516679b0c90e97fd031a111df
Author: Kyle Nusbaum knusb...@yahoo-inc.com
Date:   2014-07-01T21:05:15Z

Cleanup




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47717333
  
Yeah, what is motivating this change? When this class got introduced, @rxin 
commented that java.util.ConcurrentHashMap had bad memory footprint and 
suggested the current approach instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update SQLConf.scala

2014-07-01 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1272#issuecomment-47717946
  
Sorry, I didn't realize Reynold had already commented on this thread. The 
current changes with Option look good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1782: svd for sparse matrix using ARPACK

2014-07-01 Thread vrilleup
Github user vrilleup commented on the pull request:

https://github.com/apache/spark/pull/964#issuecomment-47722138
  
@mengxr Yes, you are right. Keeping API simple might be more important than 
flexibility. I think tol can be set to a default value (e.g. 1e-10, which is 
the default in matlab). But maxit is related to k, 300 would be enough for most 
cases, or max(300, k*2) if k is large. For the naming, how about 
computeTruncatedSVD? (I saw this term used in many papers, also in my thesis). 
or computeSVDs (like svds in matlab)?

I think separating svd and svds in two functions is better than deciding 
dense/sparse impl internally for the user. It's hard to enumerate all use cases 
and decide an impl logic. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2340] Resolve History Server file ...

2014-07-01 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/1280

[WIP][SPARK-2340] Resolve History Server file paths properly

We resolve relative paths to the local `file:/` system for `--jars` and 
`--files` in spark submit. We should do the same for the history server.

TODO: make sure event logs are also resolved properly.
TODO: test this on standalone and YARN clusters.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark hist-serv-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1280






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2340] Resolve History Server file ...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1280#issuecomment-47723879
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2340] Resolve History Server file ...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1280#issuecomment-47723890
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2340] Resolve History Server file paths...

2014-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1280#issuecomment-47725651
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >