[GitHub] spark pull request: Fixed streaming examples docs to use run-examp...

2014-05-11 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/722#discussion_r12514182
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
 ---
@@ -35,8 +35,8 @@ import org.apache.spark.SparkConf
  *is the number of threads the kafka consumer should use
  *
  * Example:
- *`./bin/spark-submit examples.jar \
- *--class org.apache.spark.examples.streaming.KafkaWordCount local[2] 
zoo01,zoo02,zoo03 \
+ *`bin/run-example \
+ *org.apache.spark.examples.streaming.KafkaWordCount local[2] 
zoo01,zoo02,zoo03 \
--- End diff --

this is outdated. KafkaWordCount no longer takes in ``


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/715#discussion_r12515148
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -259,19 +238,30 @@ private[spark] class Executor(
 }
 
 case t: Throwable => {
-  val serviceTime = System.currentTimeMillis() - taskStart
-  val metrics = attemptedTask.flatMap(t => t.metrics)
-  for (m <- metrics) {
-m.executorRunTime = serviceTime
-m.jvmGCTime = gcTime - startGCTime
-  }
-  val reason = ExceptionFailure(t.getClass.getName, t.toString, 
t.getStackTrace, metrics)
-  execBackend.statusUpdate(taskId, TaskState.FAILED, 
ser.serialize(reason))
+  // Attempt to exit cleanly by informing the driver of our 
failure.
+  // If anything goes wrong (or this was a fatal exception), we 
will delegate to
+  // the default uncaught exception handler, which will terminate 
the Executor.
+  try {
+logError("Exception in task ID " + taskId, t)
+
+val serviceTime = System.currentTimeMillis() - taskStart
+val metrics = attemptedTask.flatMap(t => t.metrics)
+for (m <- metrics) {
+  m.executorRunTime = serviceTime
+  m.jvmGCTime = gcTime - startGCTime
+}
+val reason = ExceptionFailure(t.getClass.getName, t.toString, 
t.getStackTrace, metrics)
+execBackend.statusUpdate(taskId, TaskState.FAILED, 
ser.serialize(reason))
 
-  // TODO: Should we exit the whole executor here? On the one 
hand, the failed task may
-  // have left some weird state around depending on when the 
exception was thrown, but on
-  // the other hand, maybe we could detect that when future tasks 
fail and exit then.
-  logError("Exception in task ID " + taskId, t)
+// Don't forcibly exit unless the exception was inherently 
fatal, to avoid
+// stopping other tasks unnecessarily.
+if (Utils.isFatalError(t)) {
+  ExecutorUncaughtExceptionHandler.uncaughtException(t)
+}
+  } catch {
+case t2: Throwable =>
+  ExecutorUncaughtExceptionHandler.uncaughtException(t2)
--- End diff --

Hmm, good point. I kind of like being explicit over relying on the globally 
set uncaught exception handler. I could be happy with getting rid of this and 
replacing it with a comment, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/715#issuecomment-42799372
  
Looks good to me too !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/715#discussion_r12514626
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -259,19 +238,30 @@ private[spark] class Executor(
 }
 
 case t: Throwable => {
-  val serviceTime = System.currentTimeMillis() - taskStart
-  val metrics = attemptedTask.flatMap(t => t.metrics)
-  for (m <- metrics) {
-m.executorRunTime = serviceTime
-m.jvmGCTime = gcTime - startGCTime
-  }
-  val reason = ExceptionFailure(t.getClass.getName, t.toString, 
t.getStackTrace, metrics)
-  execBackend.statusUpdate(taskId, TaskState.FAILED, 
ser.serialize(reason))
+  // Attempt to exit cleanly by informing the driver of our 
failure.
+  // If anything goes wrong (or this was a fatal exception), we 
will delegate to
+  // the default uncaught exception handler, which will terminate 
the Executor.
+  try {
+logError("Exception in task ID " + taskId, t)
+
+val serviceTime = System.currentTimeMillis() - taskStart
+val metrics = attemptedTask.flatMap(t => t.metrics)
+for (m <- metrics) {
+  m.executorRunTime = serviceTime
+  m.jvmGCTime = gcTime - startGCTime
+}
+val reason = ExceptionFailure(t.getClass.getName, t.toString, 
t.getStackTrace, metrics)
+execBackend.statusUpdate(taskId, TaskState.FAILED, 
ser.serialize(reason))
 
-  // TODO: Should we exit the whole executor here? On the one 
hand, the failed task may
-  // have left some weird state around depending on when the 
exception was thrown, but on
-  // the other hand, maybe we could detect that when future tasks 
fail and exit then.
-  logError("Exception in task ID " + taskId, t)
+// Don't forcibly exit unless the exception was inherently 
fatal, to avoid
+// stopping other tasks unnecessarily.
+if (Utils.isFatalError(t)) {
+  ExecutorUncaughtExceptionHandler.uncaughtException(t)
+}
+  } catch {
+case t2: Throwable =>
+  ExecutorUncaughtExceptionHandler.uncaughtException(t2)
--- End diff --

Can't the uncaught exception handler for this thread be set to deal with 
this, instead of another catch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/715#issuecomment-42797519
  
> It also turns out that it is unlikely that the IndestructibleActorSystem 
actually works, given testing (here).  
I works but in case of OOMs, the behavior can be very sporadic. The only 
reason it was needed was in akka 2.0.x days netty was tolerating OOMs and it 
was thus never the *chance* that Akka got to deal with them. Since netty almost 
always had them and mostly manage to eat them. Very weird things. Just saying 
for posterity. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1752][MLLIB] Standardize text format fo...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/685#issuecomment-42623397
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Include the sbin/spark-config.sh in spark-exec...

2014-05-11 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/651#issuecomment-42615028
  
LGTM - @bouk did you test this new fix and make sure it works?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/715#issuecomment-42797560
  
Looks pretty good to me, just made one small comment. I think it's good to 
eliminate these now. I haven't seen many cases where they're super useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1745] Move interrupted flag from TaskCo...

2014-05-11 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/675#issuecomment-42592936
  
 LGTM, merging into master and 1.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update RoutingTable.scala

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/647#issuecomment-42465781
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1778] [SQL] Add 'limit' transformation ...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/711#issuecomment-42642514
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1754] [SQL] Add missing arithmetic DSL ...

2014-05-11 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/689#issuecomment-42516176
  
Oops, I'll add it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/717#issuecomment-42715653
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

2014-05-11 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/715#discussion_r12514419
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -71,7 +71,7 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
   stopDaemon()
   startDaemon()
   new Socket(daemonHost, daemonPort)
-case e: Throwable => throw e
+case e: Exception => throw e
--- End diff --

Do we really need this line ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1745] Move interrupted flag from TaskCo...

2014-05-11 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/675#discussion_r12406092
  
--- Diff: core/src/main/scala/org/apache/spark/TaskContext.scala ---
@@ -58,6 +60,6 @@ class TaskContext(
   def executeOnCompleteCallbacks() {
 completed = true
 // Process complete callbacks in the reverse order of registration
-onCompleteCallbacks.reverse.foreach{_()}
+onCompleteCallbacks.reverse.foreach{ _() }
--- End diff --

oh yeah, idk how I missed hat


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/636#issuecomment-42504003
  
@mridulm @lianhuiwang thanks for the comments, I addressed all of them and 
now it should be correct 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: support leftsemijoin for sparkSQL

2014-05-11 Thread adrian-wang
Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/395#issuecomment-42795884
  
I'll switch to a newer branch with #418 to split leftsemi from other joins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed streaming examples docs to use run-examp...

2014-05-11 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/722#discussion_r12514206
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala
 ---
@@ -30,32 +30,27 @@ import org.apache.spark.streaming.receiver.Receiver
  * Custom Receiver that receives data over a socket. Received bytes is 
interpreted as
  * text and \n delimited lines are considered as records. They are then 
counted and printed.
  *
- * Usage: CustomReceiver   
- *is the Spark master URL. In local mode,  should be 
'local[n]' with n > 1.
- *and  of the TCP server that Spark Streaming would 
connect to receive data.
- *
  * To run this on your local machine, you need to first run a Netcat server
  *`$ nc -lk `
  * and then run the example
- *`$ ./run org.apache.spark.examples.streaming.CustomReceiver local[2] 
localhost `
+ *`$ bin/run-example 
org.apache.spark.examples.streaming.CustomReceiver localhost `
--- End diff --

you could actually just do `bin/run-example streaming.CustomReceiver`. Up 
to you


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed streaming examples docs to use run-examp...

2014-05-11 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/722#discussion_r12514192
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
 ---
@@ -31,8 +31,7 @@ import org.apache.spark.storage.StorageLevel
  * To run this on your local machine, you need to first run a Netcat server
  *`$ nc -lk `
  * and then run the example
- *`$ ./bin/spark-submit examples.jar \
- *--class org.apache.spark.examples.streaming.NetworkWordCount 
localhost `
+ *`$ ./bin/run-example 
org.apache.spark.examples.streaming.NetworkWordCount localhost `
--- End diff --

Same here, no localhost


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42793347
  
My only concern is that I would prefer things work slowly than fail.  With 
reference tracking disabled it is not possible to serialize user defined types 
from the spark-shell.  

A second concern is that it will be difficult for the user to enable 
reference tracking if we disable it in the  GraphX Kryo registrar.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/727#issuecomment-42787858
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Improve build configuration � �

2014-05-11 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/590#issuecomment-42792564
  
@pwendell 
Big changes have been removed.
The PR can be merged into master and branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1577: Enabling reference tracking by def...

2014-05-11 Thread jegonzal
Github user jegonzal closed the pull request at:

https://github.com/apache/spark/pull/499


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42791693
  
Alternatively found a way to work around that in the repl so it can safely
turned on.

On Sunday, May 11, 2014, Matei Zaharia  wrote:

> Alright, then I'll merge this as is. You guys should add some docs in both
> the GraphX programming guide and GraphXKryoSerializer to mention that it's
> recommended to turn off reference tracking. Just send a separate PR for
> that. (Doc changes can also go in after 1.0 is officially cut, we can
> update the website).
>
> —
> Reply to this email directly or view it on 
GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42791799
  
I think we can warn if it's on or something. I wouldn't add code to disable 
it since we might be able to fix it to work there too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42791672
  
btw as far as I can tell Kryo reference should always be disabled in the
spark repl. Should we just do that in the future?

On Sunday, May 11, 2014, Matei Zaharia  wrote:

> Alright, then I'll merge this as is. You guys should add some docs in both
> the GraphX programming guide and GraphXKryoSerializer to mention that it's
> recommended to turn off reference tracking. Just send a separate PR for
> that. (Doc changes can also go in after 1.0 is officially cut, we can
> update the website).
>
> —
> Reply to this email directly or view it on 
GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/724


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix error in 2d Graph Partitioner

2014-05-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/709


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread ankurdave
Github user ankurdave commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42791250
  
This looks good to me. Re-enabling Kryo reference tracking will have a 
performance penalty, but we can easily fix that after the release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42791443
  
Alright, then I'll merge this as is. You guys should add some docs in both 
the GraphX programming guide and GraphXKryoSerializer to mention that it's 
recommended to turn off reference tracking. Just send a separate PR for that. 
(Doc changes can also go in after 1.0 is officially cut, we can update the 
website).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1577: Enabling reference tracking by def...

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/499#issuecomment-42791416
  
Actually it looks like this will be subsumed by 
https://github.com/apache/spark/pull/724. You should close this pull request, 
as GitHub won't automatically close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-42791203
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14895/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-42791202
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1631] Correctly set the Yarn app name w...

2014-05-11 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/539#issuecomment-42513184
  
@vanzin Your proposal about having SparkSubmit calling 
`System.setProperty(spark.app.name)` can be made clean if we just always 
convert `--name` to `spark.app.name`, which is what the SparkSubmit usage 
currently suggests but does not fulfill. I will change this in a separate PR to 
address SPARK-1755.

As for this PR, the changes LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

2014-05-11 Thread ankurdave
Github user ankurdave commented on a diff in the pull request:

https://github.com/apache/spark/pull/497#discussion_r12456744
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/impl/RoutingTablePartition.scala 
---
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.impl
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.Partitioner
+import org.apache.spark.rdd.RDD
+import org.apache.spark.rdd.ShuffledRDD
+import org.apache.spark.util.collection.{BitSet, PrimitiveVector}
+
+import org.apache.spark.graphx._
+import org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap
+
+/**
+ * A message from the edge partition `pid` to the vertex partition 
containing `vid` specifying that
+ * the edge partition references `vid` in the specified `position` (src, 
dst, or both).
+*/
+private[graphx]
+class RoutingTableMessage(
+var vid: VertexId,
+var pid: PartitionID,
+var position: Byte)
+  extends Product2[VertexId, (PartitionID, Byte)] with Serializable {
+  override def _1 = vid
+  override def _2 = (pid, position)
+  override def canEqual(that: Any): Boolean = 
that.isInstanceOf[RoutingTableMessage]
+}
+
+private[graphx]
+class RoutingTableMessageRDDFunctions(self: RDD[RoutingTableMessage]) {
+  /** Copartition an `RDD[RoutingTableMessage]` with the vertex RDD with 
the given `partitioner`. */
+  def copartitionWithVertices(partitioner: Partitioner): 
RDD[RoutingTableMessage] = {
+new ShuffledRDD[VertexId, (PartitionID, Byte), 
RoutingTableMessage](self, partitioner)
+  .setSerializer(new RoutingTableMessageSerializer)
+  }
+}
+
+private[graphx]
+object RoutingTableMessageRDDFunctions {
+  import scala.language.implicitConversions
+
+  implicit def rdd2RoutingTableMessageRDDFunctions(rdd: 
RDD[RoutingTableMessage]) = {
+new RoutingTableMessageRDDFunctions(rdd)
+  }
+}
+
+private[graphx]
+object RoutingTablePartition {
+  val empty: RoutingTablePartition = new RoutingTablePartition(Array.empty)
+
+  /** Generate a `RoutingTableMessage` for each vertex referenced in 
`edgePartition`. */
+  def edgePartitionToMsgs(pid: PartitionID, edgePartition: 
EdgePartition[_, _])
+: Iterator[RoutingTableMessage] = {
+// Determine which positions each vertex id appears in using a map 
where the low 2 bits
+// represent src and dst
+val map = new PrimitiveKeyOpenHashMap[VertexId, Byte]
+edgePartition.srcIds.iterator.foreach { srcId =>
+  map.changeValue(srcId, 0x1, (b: Byte) => (b | 0x1).toByte)
+}
+edgePartition.dstIds.iterator.foreach { dstId =>
+  map.changeValue(dstId, 0x2, (b: Byte) => (b | 0x2).toByte)
+}
+map.iterator.map { vidAndPosition =>
+  new RoutingTableMessage(vidAndPosition._1, pid, vidAndPosition._2)
+}
+  }
+
+  /** Build a `RoutingTablePartition` from `RoutingTableMessage`s. */
+  def fromMsgs(numEdgePartitions: Int, iter: Iterator[RoutingTableMessage])
+: RoutingTablePartition = {
+val pid2vid = Array.fill(numEdgePartitions)(new 
PrimitiveVector[VertexId])
+val srcFlags = Array.fill(numEdgePartitions)(new 
PrimitiveVector[Boolean])
+val dstFlags = Array.fill(numEdgePartitions)(new 
PrimitiveVector[Boolean])
+for (msg <- iter) {
+  pid2vid(msg.pid) += msg.vid
+  srcFlags(msg.pid) += (msg.position & 0x1) != 0
+  dstFlags(msg.pid) += (msg.position & 0x2) != 0
+}
+
+new RoutingTablePartition(pid2vid.zipWithIndex.map {
+  case (vids, pid) => (vids.trim().array, toBitSet(srcFlags(pid)), 
toBitSet(dstFlags(pid)))
+})
+  }
+
+  /** Compact the given vector of Booleans into a BitSet. */
+  private def toBitSet(flags: PrimitiveVector[Boolean]): BitSet = {
+val bitset = new BitSet(flags.size)
+var i = 0
+while (i < flags.size) {
+

[GitHub] spark pull request: [Docs] Warn about PySpark on YARN on Red Hat

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/682#issuecomment-42478442
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14782/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12512000
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -466,30 +466,14 @@ private[spark] class Master(
* launched an executor for the app on it (right now the standalone 
backend doesn't like having
* two executors on the same worker).
*/
-  def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
-worker.memoryFree >= app.desc.memoryPerSlave && 
!worker.hasExecutor(app)
+  private def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
+worker.memoryFree >= app.desc.memoryPerExecutor && 
!worker.hasExecutor(app) &&
+worker.coresFree > 0
--- End diff --

Earlier, since single executor, it meant something else.
Now, there is a difference... I don't think this is what we would want.
Though, I would defer to others on this .. @matrix any thoughts ?
 On 12-May-2014 5:32 am, "Nan Zhu"  wrote:

> In core/src/main/scala/org/apache/spark/deploy/master/Master.scala:
>
> > @@ -466,30 +466,14 @@ private[spark] class Master(
> > * launched an executor for the app on it (right now the standalone 
backend doesn't like having
> > * two executors on the same worker).
> > */
> > -  def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
> > -worker.memoryFree >= app.desc.memoryPerSlave && 
!worker.hasExecutor(app)
> > +  private def canUse(app: ApplicationInfo, worker: WorkerInfo): 
Boolean = {
> > +worker.memoryFree >= app.desc.memoryPerExecutor && 
!worker.hasExecutor(app) &&
> > +worker.coresFree > 0
>
> I think soand we can only assign the executor to the same worker in
> the subsequent schedule() calls...
>
> and this logic has been here for a long while (at least since 0.8.x), the
> scheduling mode proposed in this PR is just to relax this constraint
>
> —
> Reply to this email directly or view it on 
GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix for SPARK-1758: failing test org.apache.sp...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/692#issuecomment-42519649
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/497#issuecomment-42483842
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470] Spark logger moving to use scala-...

2014-05-11 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/332#discussion_r12512463
  
--- Diff: project/SparkBuild.scala ---
@@ -317,6 +317,7 @@ object SparkBuild extends Build {
   val excludeFastutil = ExclusionRule(organization = "it.unimi.dsi")
   val excludeJruby = ExclusionRule(organization = "org.jruby")
   val excludeThrift = ExclusionRule(organization = "org.apache.thrift")
+  val excludeScalalogging= ExclusionRule(organization = "com.typesafe", 
artifact = "scalalogging-slf4j")
--- End diff --

I have not tested this case,and I think that this code is not readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470] Spark logger moving to use scala-...

2014-05-11 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/332#discussion_r12512415
  
--- Diff: core/src/main/scala/org/apache/spark/Logging.scala ---
@@ -116,7 +121,8 @@ trait Logging {
 val log4jInitialized = 
LogManager.getRootLogger.getAllAppenders.hasMoreElements
 if (!log4jInitialized && usingLog4j) {
   val defaultLogProps = "org/apache/spark/log4j-defaults.properties"
-  Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match 
{
+  val classLoader = this.getClass.getClassLoader
--- End diff --

Uh, I'm sorry, I think this is a problem caused by the merging code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Nicer logging for SecurityManager startup

2014-05-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/678


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixing typo in als.py

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/696#issuecomment-42580069
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix error in 2d Graph Partitioner

2014-05-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/709#issuecomment-42703154
  
@rxin and @ankurdave take a look at this minor change when you get a 
chance.  I would like to get it into the next release if possible.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1668: Add implicit preference as an opti...

2014-05-11 Thread techaddict
Github user techaddict commented on the pull request:

https://github.com/apache/spark/pull/597#issuecomment-42404618
  
@mengxr Here are few results 
```
implicitPref rank numInterations lambda -> rmse
true  10   20 1.0   -> 0.5985187619423589
true  20   20 1.0   -> 0.5822212152847526
true  30   20 1.0   -> 0.5780589497218527
true  30   40 1.0   -> 0.5776665087027969
true  30   40 0.1   -> 0.5768531690541231
true  30   40 0.001 -> 0.5756156814748565
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Docs] Update YARN docs

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/701#issuecomment-42617518
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14826/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-05-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/582


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1686: keep schedule() calling in the mai...

2014-05-11 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/639#issuecomment-42715116
  
@markhamstra Absolutely agree.

@CodingCat The test failure is unrelated, I submitted #716 to fix it. Had 
one last minor comment, other than that LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix for SPARK-1758: failing test org.apache.sp...

2014-05-11 Thread techaddict
Github user techaddict commented on the pull request:

https://github.com/apache/spark/pull/691#issuecomment-42520460
  
Why make pull requests for brach-1.0 and master both? I think #692  should 
be the only one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark-1461] Deferred Expression Evaluation (s...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/446#issuecomment-42713579
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Docs] Warn about PySpark on YARN on Red Hat

2014-05-11 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/682#issuecomment-42469878
  
I took a quick pass at the latest docs, and it looks like for 1.0+ we only 
mention maven when we talk about building. I wonder if we should still document 
the requirement for building with maven for PySpark on YARN, however, since we 
can still build with sbt even though it's not documented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1565, update examples to be used with sp...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/552#issuecomment-42517525
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/636#issuecomment-42536702
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1395] Fix "local:" URI support in Yarn ...

2014-05-11 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/560#issuecomment-42463814
  
Just rebased on top of master. No changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/455#issuecomment-42790245
  
Hey Nick, sorry I still haven't looked much at this, been delayed with 
other 1.0 stuff. I'll get to it when I can though (or get someone else to try 
it).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1577: Enabling reference tracking by def...

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/499#issuecomment-42790193
  
@jegonzal @rxin is this still needed for GraphX to work in the shell in 1.0 
or do you guys have a workaround?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-42789960
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix error in 2d Graph Partitioner

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/709#issuecomment-42789961
  
Going to merge this, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-42789966
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/730


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-42789924
  
Jenkins, this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-42789917
  
Hey, so is this a new feature that was recently added to Kryo? Seems super 
useful, but in this case, I'd actually make the max buffer size higher by 
default. Or we can use the old setting as a max, and create a new setting for 
the initial buffer size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42789506
  
LGTM, merged into master and branch-1.0. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12512027
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -532,6 +516,99 @@ private[spark] class Master(
 }
   }
 
+  private def startMultiExecutorsPerWorker() {
+// allow user to run multiple executors in the same worker
+// (within the same worker JVM process)
+if (spreadOutApps) {
+  for (app <- waitingApps if app.coresLeft > 0) {
+val memoryPerExecutor = app.desc.memoryPerExecutor
+var usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE).
+  filter(worker => worker.coresFree > 0 && worker.memoryFree >= 
memoryPerExecutor).
+  sortBy(_.memoryFree / memoryPerExecutor).reverse
+val maxCoreNumPerExecutor = app.desc.maxCorePerExecutor.get
+// get the maximum total number of executors we can assign
+var maxLeftExecutorsToAssign = usableWorkers.map(_.memoryFree / 
memoryPerExecutor).sum
+var maxCoresLeft = maxLeftExecutorsToAssign * maxCoreNumPerExecutor
--- End diff --

I am not very sure of this piece of code ... But it is too late in the
night right now, so I don't want to make obviously stupid comments due to
exhaustion :-)
I am not sure if I can get to this PR coming week; please do get it checked
out by someone else too !
On 12-May-2014 5:26 am, "Nan Zhu"  wrote:

> In core/src/main/scala/org/apache/spark/deploy/master/Master.scala:
>
> > @@ -532,6 +516,99 @@ private[spark] class Master(
> >  }
> >}
> >
> > +  private def startMultiExecutorsPerWorker() {
> > +// allow user to run multiple executors in the same worker
> > +// (within the same worker JVM process)
> > +if (spreadOutApps) {
> > +  for (app <- waitingApps if app.coresLeft > 0) {
> > +val memoryPerExecutor = app.desc.memoryPerExecutor
> > +var usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE).
> > +  filter(worker => worker.coresFree > 0 && worker.memoryFree 
>= memoryPerExecutor).
> > +  sortBy(_.memoryFree / memoryPerExecutor).reverse
> > +val maxCoreNumPerExecutor = app.desc.maxCorePerExecutor.get
> > +// get the maximum total number of executors we can assign
> > +var maxLeftExecutorsToAssign = usableWorkers.map(_.memoryFree 
/ memoryPerExecutor).sum
> > +var maxCoresLeft = maxLeftExecutorsToAssign * 
maxCoreNumPerExecutor
>
> the idea here is, user has an expectation on the maximum cores to assign
> to the application, but this expectation is usually not achievable due to
> the limited cores in each worker;
>
> so the allocation here is to decide executorNum per Worker according to
> the memory space on each worker (because this is a hard limitation), and
> meet the user's expectation on the cores with the best efforts
>
> —
> Reply to this email directly or view it on 
GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/727#issuecomment-42788831
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14894/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1487 [SQL] Support record filtering via ...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/511#issuecomment-42737719
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/727#issuecomment-42788830
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42788446
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14892/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42788445
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/727


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1779] add warning when memoryFraction i...

2014-05-11 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/714#discussion_r12511779
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
 ---
@@ -76,6 +76,12 @@ class ExternalAppendOnlyMap[K, V, C](
   private val maxMemoryThreshold = {
 val memoryFraction = 
sparkConf.getDouble("spark.shuffle.memoryFraction", 0.3)
 val safetyFraction = 
sparkConf.getDouble("spark.shuffle.safetyFraction", 0.8)
+if (memoryFraction > 1 && memoryFraction < 0) {
--- End diff --

oops, good call


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42787495
  
@aarondav check again? Responded to your feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/727#issuecomment-42787822
  
Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511553
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -532,6 +516,99 @@ private[spark] class Master(
 }
   }
 
+  private def startMultiExecutorsPerWorker() {
+// allow user to run multiple executors in the same worker
+// (within the same worker JVM process)
+if (spreadOutApps) {
+  for (app <- waitingApps if app.coresLeft > 0) {
+val memoryPerExecutor = app.desc.memoryPerExecutor
+var usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE).
+  filter(worker => worker.coresFree > 0 && worker.memoryFree >= 
memoryPerExecutor).
+  sortBy(_.memoryFree / memoryPerExecutor).reverse
+val maxCoreNumPerExecutor = app.desc.maxCorePerExecutor.get
+// get the maximum total number of executors we can assign
+var maxLeftExecutorsToAssign = usableWorkers.map(_.memoryFree / 
memoryPerExecutor).sum
+var maxCoresLeft = maxLeftExecutorsToAssign * maxCoreNumPerExecutor
+val numUsable = usableWorkers.length
+// Number of cores of each executor assigned to each worker
+val assigned = Array.fill[ListBuffer[Int]](numUsable)(new 
ListBuffer[Int])
+val assignedSum = Array.fill[Int](numUsable)(0)
+var pos = 0
+val noEnoughMemoryWorkers = new HashSet[Int]
+while (maxLeftExecutorsToAssign > 0 && noEnoughMemoryWorkers.size 
< numUsable) {
+  if (usableWorkers(pos).coresFree - assignedSum(pos) >= 0) {
--- End diff --

> 0 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/727#issuecomment-42787862
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42787533
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14893/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511614
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -532,6 +516,99 @@ private[spark] class Master(
 }
   }
 
+  private def startMultiExecutorsPerWorker() {
+// allow user to run multiple executors in the same worker
+// (within the same worker JVM process)
+if (spreadOutApps) {
+  for (app <- waitingApps if app.coresLeft > 0) {
+val memoryPerExecutor = app.desc.memoryPerExecutor
+var usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE).
+  filter(worker => worker.coresFree > 0 && worker.memoryFree >= 
memoryPerExecutor).
+  sortBy(_.memoryFree / memoryPerExecutor).reverse
+val maxCoreNumPerExecutor = app.desc.maxCorePerExecutor.get
+// get the maximum total number of executors we can assign
+var maxLeftExecutorsToAssign = usableWorkers.map(_.memoryFree / 
memoryPerExecutor).sum
+var maxCoresLeft = maxLeftExecutorsToAssign * maxCoreNumPerExecutor
--- End diff --

the idea here is, user has an expectation on the maximum cores to assign to 
the application, but this expectation is usually not achievable due to the 
limited cores in each worker; 

so the allocation here is to decide executorNum per Worker according to the 
memory space on each worker (because this is a hard limitation), and meet the 
user's expectation on the cores with the best efforts 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511602
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -466,30 +466,14 @@ private[spark] class Master(
* launched an executor for the app on it (right now the standalone 
backend doesn't like having
* two executors on the same worker).
*/
-  def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
-worker.memoryFree >= app.desc.memoryPerSlave && 
!worker.hasExecutor(app)
+  private def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
+worker.memoryFree >= app.desc.memoryPerExecutor && 
!worker.hasExecutor(app) &&
+worker.coresFree > 0
--- End diff --

So what happens if the worker is already running one executor for the app - 
we cant schedule another executor on that worker until previous one is done ? 
(in this or subsequent schedule attempts)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/727#discussion_r12511646
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -202,6 +202,39 @@ class RDDSuite extends FunSuite with 
SharedSparkContext {
 assert(repartitioned2.collect().toSet === (1 to 1000).toSet)
   }
 
+  test("repartitioned RDDs perform load balancing") {
+// Coalesce partitions
+val input = Array.fill(1000)(1)
+val initialPartitions = 10
+val data = sc.parallelize(input, initialPartitions)
+
+val repartitioned1 = data.repartition(2)
+assert(repartitioned1.partitions.size == 2)
+val partitions1 = repartitioned1.glom().collect()
+// some noise in balancing is allowed due to randomization
+assert(math.abs(partitions1(0).length - 500) < initialPartitions)
+assert(math.abs(partitions1(1).length - 500) < initialPartitions)
+assert(repartitioned1.collect() === input)
+
+def testSplitPartitions(input: Seq[Int], initialPartitions: Int, 
finalPartitions: Int) {
+  val data = sc.parallelize(input, initialPartitions)
+  val repartitioned = data.repartition(finalPartitions)
+  assert(repartitioned.partitions.size == finalPartitions)
--- End diff --

Maybe you use `===` here for nicer message


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511647
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -466,30 +466,14 @@ private[spark] class Master(
* launched an executor for the app on it (right now the standalone 
backend doesn't like having
* two executors on the same worker).
*/
-  def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
-worker.memoryFree >= app.desc.memoryPerSlave && 
!worker.hasExecutor(app)
+  private def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
+worker.memoryFree >= app.desc.memoryPerExecutor && 
!worker.hasExecutor(app) &&
+worker.coresFree > 0
--- End diff --

I think soand we can only assign the executor to the same worker in the 
subsequent schedule() calls...

and this logic has been here for a long while (at least since 0.8.x), the 
scheduling mode proposed in this PR is just to relax this constraint


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/727#discussion_r12511628
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -328,11 +328,20 @@ abstract class RDD[T: ClassTag](
   def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: 
Ordering[T] = null)
   : RDD[T] = {
 if (shuffle) {
+  /** Distributes elements evenly across output partitions, starting 
from a random partition. */
+  def distributePartition(index: Int, items: Iterator[T]): 
Iterator[(Int, T)] = {
+var position = (new Random(index)).nextInt(numPartitions)
+items.map{ t =>
+  position = position + 1 % numPartitions
--- End diff --

This is going to mod the 1 with numPartitions and keep increasing 
`position`, probably not exactly what we want. In reality passing just 
`position` as the key would be fine, because the hashCode for it will be 
position as well, and the Partitioner will mod that with `numPartitions`. No 
need to mod twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42787532
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

2014-05-11 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/727#discussion_r12511624
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -328,11 +328,20 @@ abstract class RDD[T: ClassTag](
   def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: 
Ordering[T] = null)
   : RDD[T] = {
 if (shuffle) {
+  /** Distributes elements evenly across output partitions, starting 
from a random partition. */
+  def distributePartition(index: Int, items: Iterator[T]): 
Iterator[(Int, T)] = {
+var position = (new Random(index)).nextInt(numPartitions)
+items.map{ t =>
--- End diff --

Put a space before `{`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42787516
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42787518
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42787512
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42787511
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511580
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -532,6 +516,99 @@ private[spark] class Master(
 }
   }
 
+  private def startMultiExecutorsPerWorker() {
+// allow user to run multiple executors in the same worker
+// (within the same worker JVM process)
+if (spreadOutApps) {
+  for (app <- waitingApps if app.coresLeft > 0) {
+val memoryPerExecutor = app.desc.memoryPerExecutor
+var usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE).
+  filter(worker => worker.coresFree > 0 && worker.memoryFree >= 
memoryPerExecutor).
+  sortBy(_.memoryFree / memoryPerExecutor).reverse
+val maxCoreNumPerExecutor = app.desc.maxCorePerExecutor.get
+// get the maximum total number of executors we can assign
+var maxLeftExecutorsToAssign = usableWorkers.map(_.memoryFree / 
memoryPerExecutor).sum
+var maxCoresLeft = maxLeftExecutorsToAssign * maxCoreNumPerExecutor
--- End diff --

I am not sure what maxCoresLeft is supposed to signify.
maxLeftExecutorsToAssign seems to be assigned to number of memory slots 
available
Which seems orthogonal to maxCoreNumPerExecutor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42787343
  
I would like to get it into 1.0 if possible.  Otherwise, we could run into 
issues if the user persists graphs to disk or straggler mitigation is used. 
@ankurdave do you see any issues with trying to get this into 1.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511545
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -466,30 +466,14 @@ private[spark] class Master(
* launched an executor for the app on it (right now the standalone 
backend doesn't like having
* two executors on the same worker).
*/
-  def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
-worker.memoryFree >= app.desc.memoryPerSlave && 
!worker.hasExecutor(app)
+  private def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
+worker.memoryFree >= app.desc.memoryPerExecutor && 
!worker.hasExecutor(app) &&
+worker.coresFree > 0
--- End diff --

yes, 

but this function is only called when we want to schedule a single executor 
to a certain worker


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511541
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -532,6 +516,99 @@ private[spark] class Master(
 }
   }
 
+  private def startMultiExecutorsPerWorker() {
+// allow user to run multiple executors in the same worker
+// (within the same worker JVM process)
+if (spreadOutApps) {
+  for (app <- waitingApps if app.coresLeft > 0) {
+val memoryPerExecutor = app.desc.memoryPerExecutor
+var usableWorkers = workers.toArray.filter(_.state == 
WorkerState.ALIVE).
+  filter(worker => worker.coresFree > 0 && worker.memoryFree >= 
memoryPerExecutor).
--- End diff --

Push toArray after filers ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/731#discussion_r12511528
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -466,30 +466,14 @@ private[spark] class Master(
* launched an executor for the app on it (right now the standalone 
backend doesn't like having
* two executors on the same worker).
*/
-  def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
-worker.memoryFree >= app.desc.memoryPerSlave && 
!worker.hasExecutor(app)
+  private def canUse(app: ApplicationInfo, worker: WorkerInfo): Boolean = {
+worker.memoryFree >= app.desc.memoryPerExecutor && 
!worker.hasExecutor(app) &&
+worker.coresFree > 0
--- End diff --

I am not sure about this, but does the above mean that an application can 
be scheduled only once to a worker at a given point of time ?
So even if there are multiple cores, different partitions cant be executed 
in parallel for an app on that worker ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] Simplify the build with sbt 0.13.2 featu...

2014-05-11 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/706#issuecomment-42786776
  
Here is a link to our mailing list discussion on the topic:

http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-tp2315.html

Note that also the biggest time sink right now is preparing the 1.0
release, many people are running around trying to finalize APIs and bug
fixes before 1.0 is shipped, and don't have so much time at the moment to
closely examine features for 1.1.


On Sun, May 11, 2014 at 3:24 PM, Jacek Laskowski
wrote:

> Thanks! I've said it before (when @pwendell 
asked to hold off) and now I'll say it again as my 
changes don't seem to
> find home soon before the *"We are still experimenting"*'s over. When is
> the experimentation happening? Is there a branch for it? Is there a
> discussion on the mailing list(s) how it's going to be done? I'd 
appreciate
> more openness in this regard (to avoid Kafka's case where they moved to
> gradle for no apparent reasons other than that they didn't seem to have
> cared to learn sbt enough).
>
> —
> Reply to this email directly or view it on 
GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-42774786
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/730#discussion_r12511378
  
--- Diff: bin/spark-submit ---
@@ -35,8 +35,10 @@ while (($#)); do
   shift
 done
 
-if [ ! -z $DRIVER_MEMORY ] && [ ! -z $DEPLOY_MODE ] && [ $DEPLOY_MODE = 
"client" ]; then
-  export SPARK_MEM=$DRIVER_MEMORY
+DEPLOY_MODE=${DEPLOY_MODE:-"client"}
+
+if [ ! -z $DRIVER_MEMORY ] && [ ! $DEPLOY_MODE == "cluster" ]; then
--- End diff --

Sorry to not notice this earlier,but given the default value, I think the 
equality check with "client" now may make more sense.

Also, along similar lines, perhaps we could use `-n` instead of `! -z` for 
DRIVER_MEMORY, and also wrap the variables in quotes. For instance, right now 
if $DRIVER_MEMORY is not set, this is actually evaluating `if [ ! -z ]`, which 
happens to evaluate to false, but is confusing because `if [ ! -n ]` also 
evaluates to false. By putting quotes around it, we'll make sure we actually 
evaluate `if [ -n "" ]` if it's empty, which has the semantics we're actually 
looking for.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/725#issuecomment-42757678
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1652: Set driver memory correctly in spa...

2014-05-11 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/730#issuecomment-42783193
  
@aarondav - would never want to make you nervous. I made the suggested 
change. Mind taking a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1690] Allow empty lines in PythonRDD

2014-05-11 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/644#issuecomment-42507072
  
@mateiz just realized I could test it from Python side. Added a doctest. 
This makes Python API behaves identical to Scala API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1755] Respect SparkSubmit --name on YAR...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/699#issuecomment-42596162
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Fix Performance Issue in data type casti...

2014-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/679#issuecomment-42409179
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14766/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >