[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread mridul
Github user mridul commented on the pull request:

https://github.com/apache/spark/pull/900#issuecomment-46661369
  
Hi,
Sorry to intrude in your thread, but perhaps you have the wrong @mridul
referenced in your comments :)

Mridul


On 20 June 2014 14:01, Zhihui Li  wrote:

> Thanks @tgravescs 
> I add a new commit.
>
>- code style
>- default minRegisteredRatio = 0 in yarn mode
>- driver get --num-executors in yarn/alpha
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1293 [SQL] Parquet support for nested ty...

2014-06-20 Thread AndreSchumacher
Github user AndreSchumacher commented on the pull request:

https://github.com/apache/spark/pull/360#issuecomment-46664113
  
Thanks @rxin for merging! Great to see it finally in master. I understand 
that it's quite a bit larger than the average PR. Hopefully there won't be too 
many issues though. Thanks to @marmbrus and @aarondav for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2219][SQL] Fix add jar to execute with ...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1154#issuecomment-46664179
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15957/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2219][SQL] Fix add jar to execute with ...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1154#issuecomment-46664178
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-1720]use LD_LIBRARY_PATH instead o...

2014-06-20 Thread witgo
Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/1031


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-1720]use LD_LIBRARY_PATH instead o...

2014-06-20 Thread witgo
GitHub user witgo reopened a pull request:

https://github.com/apache/spark/pull/1031

[WIP][SPARK-1720]use LD_LIBRARY_PATH instead of -Djava.library.path



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark SPARK-1720

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1031.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1031


commit e40de5e1abb26826171ab84f0befde121e62d8d8
Author: witgo 
Date:   2014-06-10T07:46:43Z

use LD_LIBRARY_PATH instead of -Djava.library.path

commit 60153624dc9302258bc453e8b2aceaa5e1b42659
Author: witgo 
Date:   2014-06-20T08:13:41Z

Merge branch 'master' of https://github.com/apache/spark into SPARK-1720




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-1720]use LD_LIBRARY_PATH instead o...

2014-06-20 Thread witgo
Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/1031


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-1720]use LD_LIBRARY_PATH instead o...

2014-06-20 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1031#issuecomment-4248
  
This solution won't work


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.executor.extraLibraryPa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46667705
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.executor.extraLibraryPa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46667710
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.executor.extraLibraryPa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46670237
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15958/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.executor.extraLibraryPa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46670236
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46676449
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46676431
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46677366
  
Thanks for the quick review and patch, @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1155#issuecomment-46683617
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-06-20 Thread avulanov
GitHub user avulanov opened a pull request:

https://github.com/apache/spark/pull/1155

[MLLIB] [SPARK-] Add multiclass evaluation metrics

Adding two classes: 
1) MulticlassMetrics implements various multiclass evaluation metrics
2) MulticlassMetricsSuite implements unit tests for MulticlassMetrics

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/avulanov/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1155


commit d535d62e4518e5054a2784ee29ea478f471a67b9
Author: unknown 
Date:   2014-06-19T15:39:03Z

Multiclass evaluation

commit fcee82d0b99efc5f3416ac43c98d7914a31f40e0
Author: unknown 
Date:   2014-06-20T13:42:28Z

Unit tests. Class rename

commit a5c8ba46689ed13080eb0681814a1d7c1a0cf497
Author: unknown 
Date:   2014-06-20T14:02:47Z

Unit tests. Class rename

commit d5ce98103ddb97c8c83f07a2a71b9a22053c2cde
Author: unknown 
Date:   2014-06-20T14:18:51Z

Comments about Double




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46684490
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15959/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46684488
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-46684950
  
ping: This should go into 1.0.1
@pwendell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1104#issuecomment-46694507
  
Merged. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2163] class LBFGS optimize with Double ...

2014-06-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-06-20 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-46702871
  
Thanks, that makes sense.

The block manager stuff seems to be pretty self-contained.  The only data 
included in the block manager heartbeat is the block manager ID, and the rest 
of the block manager RPCs concern block-related happenings.  So my inclination 
is to not muck this up with task data and to add a general heartbeat actor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2219][SQL] Fix add jar to execute with ...

2014-06-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1154#issuecomment-46706270
  
This needs to call Spark's addJar, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14031997
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -313,6 +314,47 @@ class DAGSchedulerSuite extends 
TestKit(ActorSystem("DAGSchedulerSuite")) with F
 assertDataStructuresEmpty
   }
 
+  test("job cancellation no-kill backend") {
+// make sure that the DAGScheduler doesn't crash when the TaskScheduler
+// doesn't implement killTask()
+val noKillTaskScheduler = new TaskScheduler() {
+  override def rootPool: Pool = null
+  override def schedulingMode: SchedulingMode = SchedulingMode.NONE
+  override def start() = {}
+  override def stop() = {}
+  override def submitTasks(taskSet: TaskSet) = {
+// normally done by TaskSetManager
+taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)
--- End diff --

are these lines necessary (can you just do nothing here?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-46706691
  
Just a small comment on the tests but other than that this looks good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

2014-06-20 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/813#issuecomment-46708057
  
Looks good - thanks for this. I'm going to merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/900#discussion_r14032678
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -46,9 +46,14 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, actorSystem: A
 {
   // Use an atomic variable to track total number of cores in the cluster 
for simplicity and speed
   var totalCoreCount = new AtomicInteger(0)
+  var totalExecutors = new AtomicInteger(0)
   val conf = scheduler.sc.conf
   private val timeout = AkkaUtils.askTimeout(conf)
   private val akkaFrameSize = AkkaUtils.maxFrameSizeBytes(conf)
+  var minRegisteredRatio = 
conf.getDouble("spark.scheduler.minRegisteredRatio", 0)
--- End diff --

Can you change this to minRegisteredExecutorsRatio?  It's more verbose but 
I think better to be more descriptive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/900#discussion_r14032722
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster
+
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.deploy.yarn.ApplicationMasterArguments
+import org.apache.spark.scheduler.TaskSchedulerImpl
+
+import scala.collection.mutable.ArrayBuffer
--- End diff --

fix import ordering


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

2014-06-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/813


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/900#discussion_r14032829
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster
+
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.deploy.yarn.ApplicationMasterArguments
+import org.apache.spark.scheduler.TaskSchedulerImpl
+
+import scala.collection.mutable.ArrayBuffer
+
+private[spark] class YarnClusterSchedulerBackend(
+scheduler: TaskSchedulerImpl,
+sc: SparkContext)
+  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
+  with Logging {
+
+  private[spark] def addArg(optionName: String, envVar: String, sysProp: 
String,
--- End diff --

Why is this private[spark] as opposed to just private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14032954
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -313,6 +314,47 @@ class DAGSchedulerSuite extends 
TestKit(ActorSystem("DAGSchedulerSuite")) with F
 assertDataStructuresEmpty
   }
 
+  test("job cancellation no-kill backend") {
+// make sure that the DAGScheduler doesn't crash when the TaskScheduler
+// doesn't implement killTask()
+val noKillTaskScheduler = new TaskScheduler() {
+  override def rootPool: Pool = null
+  override def schedulingMode: SchedulingMode = SchedulingMode.NONE
+  override def start() = {}
+  override def stop() = {}
+  override def submitTasks(taskSet: TaskSet) = {
+// normally done by TaskSetManager
+taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)
--- End diff --

Sure, doing nothing is easy.


On Fri, Jun 20, 2014 at 10:47 AM, Kay Ousterhout 
wrote:

> In core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
>
> > @@ -313,6 +314,47 @@ class DAGSchedulerSuite extends 
TestKit(ActorSystem("DAGSchedulerSuite")) with F
> >  assertDataStructuresEmpty
> >}
> >
> > +  test("job cancellation no-kill backend") {
> > +// make sure that the DAGScheduler doesn't crash when the 
TaskScheduler
> > +// doesn't implement killTask()
> > +val noKillTaskScheduler = new TaskScheduler() {
> > +  override def rootPool: Pool = null
> > +  override def schedulingMode: SchedulingMode = SchedulingMode.NONE
> > +  override def start() = {}
> > +  override def stop() = {}
> > +  override def submitTasks(taskSet: TaskSet) = {
> > +// normally done by TaskSetManager
> > +taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)
>
> are these lines necessary (can you just do nothing here?)
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-46709007
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-46709020
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1412][SQL] Disable partial aggregation ...

2014-06-20 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1152#issuecomment-46709082
  
@rxin If we are simply trying to read the default values for the params, 
but not user-set ones (i.e. in the absence of a `SQLContext` in `execute()`, I 
think we could move the default param values to a companion object of 
`SQLConf`, and in the assessors of this class, either get the user-set values 
or else get the default values from the static object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/900#discussion_r14033313
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster
+
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.deploy.yarn.ApplicationMasterArguments
+import org.apache.spark.scheduler.TaskSchedulerImpl
+
+import scala.collection.mutable.ArrayBuffer
+
+private[spark] class YarnClusterSchedulerBackend(
+scheduler: TaskSchedulerImpl,
+sc: SparkContext)
+  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
+  with Logging {
+
+  private[spark] def addArg(optionName: String, envVar: String, sysProp: 
String,
+  arrayBuf: ArrayBuffer[String]) {
+if (System.getenv(envVar) != null) {
+  arrayBuf += (optionName, System.getenv(envVar))
+} else if (sc.getConf.contains(sysProp)) {
+  arrayBuf += (optionName, sc.getConf.get(sysProp))
+}
+  }
+
+  override def start() {
+super.start()
+val argsArrayBuf = new ArrayBuffer[String]()
+List(("--num-executors", "SPARK_EXECUTOR_INSTANCES", 
"spark.executor.instances"),
+  ("--num-executors", "SPARK_WORKER_INSTANCES", 
"spark.worker.instances"))
+  .foreach { case (optName, envVar, sysProp) => addArg(optName, 
envVar, sysProp, argsArrayBuf) }
+val args = new ApplicationMasterArguments(argsArrayBuf.toArray)
+totalExecutors.set(args.numExecutors)
--- End diff --

I'm a little confused here -- is the point of this code just to set 
CoarseGrainedSchedulerBackend.totalExecutors?  Why do you check both 
SPARK_WORKER_INSTANCES and SPARK_EXECUTOR_INSTANCES to set the number of 
executors?  Don't these mean different things?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1112, 2156] Bootstrap to fetch the driv...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1132#discussion_r14033961
  
--- Diff: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ---
@@ -19,8 +19,11 @@ package org.apache.spark.executor
 
 import java.nio.ByteBuffer
 
+import scala.concurrent.Await
+
 import akka.actor._
 import akka.remote._
+import akka.pattern.Patterns
--- End diff --

nit: alphabetize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1112, 2156] Bootstrap to fetch the driv...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1132#discussion_r14034021
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala
 ---
@@ -22,19 +22,21 @@ import java.nio.ByteBuffer
 import org.apache.spark.TaskState.TaskState
 import org.apache.spark.scheduler.TaskDescription
 import org.apache.spark.util.{SerializableBuffer, Utils}
+import org.apache.spark.SparkConf
--- End diff --

nit: import ordering (this should come first)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1112, 2156] Bootstrap to fetch the driv...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1132#issuecomment-46710983
  
This looks great -- I think this is definitely better than the alternative.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixing AWS instance type information based upo...

2014-06-20 Thread jerry86
GitHub user jerry86 opened a pull request:

https://github.com/apache/spark/pull/1156

Fixing AWS instance type information based upon current EC2 data

Fixed a problem in previous file in which some information regarding AWS 
instance types were wrong. Such information was updated base upon current AWS 
EC2 data.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerry86/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1156


commit ff36e95a9874f475314eba3f9985307eea5ca280
Author: Zichuan Ye 
Date:   2014-06-20T18:24:01Z

Fixing AWS instance type information based upon current EC2 data




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Cleanup on Connection, ConnectionManagerId, Co...

2014-06-20 Thread hsaputra
GitHub user hsaputra opened a pull request:

https://github.com/apache/spark/pull/1157

Cleanup on Connection, ConnectionManagerId, ConnectionManager classes part 2

Cleanup on Connection, ConnectionManagerId, and ConnectionManager classes 
part 2 while I was working at the code there to help IDE:
1. Remove unused imports
2. Remove parentheses in method calls that do not have side affect.
3. Add parentheses in method calls that do have side effect or not simple 
get to object properties.
4. Change if-else check (via isInstanceOf) for Connection class type with 
Scala expression for consistency and cleanliness.
5. Remove semicolon
6. Remove extra spaces.
7. Remove redundant return for consistency

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hsaputra/spark 
cleanup_connection_classes_part2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1157


commit 85b24f70eae2a0ff06874a9b1f656a75550f04a9
Author: Henry Saputra 
Date:   2014-06-20T18:30:57Z

Cleanup on Connection and ConnectionManager classes part 2 while I was 
working at the code there to help IDE:
1. Remove unused imports
2. Remove parentheses in method calls that do not have side affect.
3. Add parentheses in method calls that do have side effect.
4. Change if-else check (via isInstanceOf) for Connection class type with 
Scala expression for consitency and cleanliness.
5. Remove semicolon
6. Remove extra spaces.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Cleanup on Connection, ConnectionManagerId, Co...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1157#issuecomment-46712896
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixing AWS instance type information based upo...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1156#issuecomment-46712885
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Cleanup on Connection, ConnectionManagerId, Co...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1157#issuecomment-46712883
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Use hive.SessionState, not the thread lo...

2014-06-20 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1148#issuecomment-46713449
  
For some reason I feel like this shouldn't be affecting spark streaming 
tests...

Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-46713416
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/686#issuecomment-46713417
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15960/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.*.extraLibraryPath isn'...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46713432
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.*.extraLibraryPath isn'...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46713450
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Cleanup on Connection, ConnectionManagerId, Co...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1157#issuecomment-46713995
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread frol
GitHub user frol opened a pull request:

https://github.com/apache/spark/pull/1158

Fixed small running on YARN docs typo

The backslash is needed for multiline command

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/frol/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1158


commit e258044a76b01997c37faa2b48880768fb90cfe3
Author: Vlad 
Date:   2014-06-20T18:54:14Z

Fixed small running on YARN docs typo

The backslash is needed for multiline command




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1158#issuecomment-46713970
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread tgravescs
GitHub user tgravescs opened a pull request:

https://github.com/apache/spark/pull/1159

SPARK-1528 - spark on yarn, add support for accessing remote HDFS

Add a config (spark.yarn.access.namenodes) to allow applications running on 
yarn to access other secure HDFS cluster.  User just specifies the namenodes of 
the other clusters and we get Tokens for those and ship them with the spark 
application.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgravescs/spark spark-1528

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1159.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1159


commit 2beabdae1cefe9ce6c923c6e3e0e780c5cd4e26a
Author: Thomas Graves 
Date:   2014-06-20T18:32:50Z

SPARK-1528 - add support for accessing remote HDFS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/900#discussion_r14035880
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster
+
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.deploy.yarn.ApplicationMasterArguments
+import org.apache.spark.scheduler.TaskSchedulerImpl
+
+import scala.collection.mutable.ArrayBuffer
+
+private[spark] class YarnClusterSchedulerBackend(
+scheduler: TaskSchedulerImpl,
+sc: SparkContext)
+  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
+  with Logging {
+
+  private[spark] def addArg(optionName: String, envVar: String, sysProp: 
String,
+  arrayBuf: ArrayBuffer[String]) {
+if (System.getenv(envVar) != null) {
+  arrayBuf += (optionName, System.getenv(envVar))
+} else if (sc.getConf.contains(sysProp)) {
+  arrayBuf += (optionName, sc.getConf.get(sysProp))
+}
+  }
+
+  override def start() {
+super.start()
+val argsArrayBuf = new ArrayBuffer[String]()
+List(("--num-executors", "SPARK_EXECUTOR_INSTANCES", 
"spark.executor.instances"),
+  ("--num-executors", "SPARK_WORKER_INSTANCES", 
"spark.worker.instances"))
+  .foreach { case (optName, envVar, sysProp) => addArg(optName, 
envVar, sysProp, argsArrayBuf) }
+val args = new ApplicationMasterArguments(argsArrayBuf.toArray)
+totalExecutors.set(args.numExecutors)
--- End diff --

no, the config used to be called SPARK_WORKER_INSTANCES and now its 
SPARK_EXECUTOR_INSTANCES.  Workers really meant executors on yarn. So this is 
just for backwards compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1159#issuecomment-46714533
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1159#issuecomment-46714550
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Cleanup on Connection, ConnectionManagerId, Co...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1157#issuecomment-46713996
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15961/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1158#issuecomment-46714672
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1158#issuecomment-46715082
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1158#issuecomment-46715065
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46715422
  
I tried `having.q` in hive, I got an error on running `SELECT key FROM src 
GROUP BY key HAVING max(value) > "val_255"`. The reason is that the output of 
an `Aggregate` only has `selectExpressions`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit stage after (configured ra...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/900#discussion_r14036842
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster
+
+
+import org.apache.spark.{Logging, SparkContext}
+import org.apache.spark.deploy.yarn.ApplicationMasterArguments
+import org.apache.spark.scheduler.TaskSchedulerImpl
+
+import scala.collection.mutable.ArrayBuffer
+
+private[spark] class YarnClusterSchedulerBackend(
+scheduler: TaskSchedulerImpl,
+sc: SparkContext)
+  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
+  with Logging {
+
+  private[spark] def addArg(optionName: String, envVar: String, sysProp: 
String,
+  arrayBuf: ArrayBuffer[String]) {
+if (System.getenv(envVar) != null) {
+  arrayBuf += (optionName, System.getenv(envVar))
+} else if (sc.getConf.contains(sysProp)) {
+  arrayBuf += (optionName, sc.getConf.get(sysProp))
+}
+  }
+
+  override def start() {
+super.start()
+val argsArrayBuf = new ArrayBuffer[String]()
+List(("--num-executors", "SPARK_EXECUTOR_INSTANCES", 
"spark.executor.instances"),
+  ("--num-executors", "SPARK_WORKER_INSTANCES", 
"spark.worker.instances"))
+  .foreach { case (optName, envVar, sysProp) => addArg(optName, 
envVar, sysProp, argsArrayBuf) }
+val args = new ApplicationMasterArguments(argsArrayBuf.toArray)
+totalExecutors.set(args.numExecutors)
--- End diff --

Ah I see -- I was confused by this 
http://spark.apache.org/docs/latest/spark-standalone.html -- since standalone 
mode interprets SPARK_WORKER_INSTANCES differently.  Sorry for the confusion!

My other question here was about the ApplicationMaterArguments here -- does 
that actually get used, or is it just constructed as a way to get numExecutors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Feat kryo max buffersize

2014-06-20 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/735#issuecomment-46716428
  
hey sorry somehow misses this conversation thread. sure will update
defaults and docs


On Wed, Jun 4, 2014 at 1:48 AM, Patrick Wendell 
wrote:

> @koertkuipers  any interest in updating
> the docs and bumping the default? This would be a good change to have.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14037007
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1062,10 +1062,15 @@ class DAGScheduler(
   // This is the only job that uses this stage, so fail the stage 
if it is running.
   val stage = stageIdToStage(stageId)
   if (runningStages.contains(stage)) {
-taskScheduler.cancelTasks(stageId, shouldInterruptThread)
-val stageInfo = stageToInfos(stage)
-stageInfo.stageFailed(failureReason)
-
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+try { // cancelTasks will fail if a SchedulerBackend does not 
implement killTask
+  taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+  val stageInfo = stageToInfos(stage)
+  stageInfo.stageFailed(failureReason)
+  
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+} catch {
+  case e: UnsupportedOperationException =>
+logInfo(s"Could not cancel tasks for stage $stageId", e)
+}
--- End diff --

Sorry Mark one more question here -- can we move the 
job.listener.jobFailed(error) call from line 1041 to here in the "try" clause?  
It seems weird to tell the user the job has been cancelled when, in fact, it 
hasn't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.*.extraLibraryPath isn'...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46717631
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15962/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]SPARK-1719: spark.*.extraLibraryPath isn'...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-46717630
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1159#issuecomment-46718362
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1159#issuecomment-46718364
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15963/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1158#issuecomment-46718870
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed small running on YARN docs typo

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1158#issuecomment-46718871
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15964/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1159#discussion_r14038384
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -547,4 +535,36 @@ object ClientBase extends Logging {
 null
   }
 
+  // get the list of namenodes the user may access
--- End diff --

nit: proper scaladoc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1159#discussion_r14038387
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -547,4 +535,36 @@ object ClientBase extends Logging {
 null
   }
 
+  // get the list of namenodes the user may access
+  private[yarn] def getNameNodesToAccess(sparkConf: SparkConf): Set[Path] 
= {
+sparkConf.get("spark.yarn.access.namenodes", 
"").split(",").map(_.trim()).filter(!_.isEmpty)
+  .map(new Path(_)).toSet
+  }
+
+  private[yarn] def getTokenRenewer(conf: Configuration): String = {
+val delegTokenRenewer = Master.getMasterPrincipal(conf)
+logDebug("delegation token renewer is: " + delegTokenRenewer)
+if (delegTokenRenewer == null || delegTokenRenewer.length() == 0) {
+  val errorMessage = "Can't get Master Kerberos principal for use as 
renewer"
+  logError(errorMessage)
+  throw new SparkException(errorMessage)
+}
+delegTokenRenewer
+  }
+
+  // obtains tokens for the namenodes passed in and adds them to the 
credentials
--- End diff --

nit: proper scaladoc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1528 - spark on yarn, add support for ac...

2014-06-20 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1159#discussion_r14038548
  
--- Diff: docs/running-on-yarn.md ---
@@ -95,6 +95,13 @@ Most of the configs are the same for Spark on YARN as 
for other deployment modes
 The amount of off heap memory (in megabytes) to be allocated per 
driver. This is memory that accounts for things like VM overheads, interned 
strings, other native overheads, etc.
   
 
+
+  spark.yarn.access.namenodes
+  (none)
+  
+A list of secure HDFS namenodes your spark application is going to 
access. For example, 
spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032. Spark 
acquires security Tokens for each of the namenodes so that the spark 
application can access those remote HDFS clusters.  
--- End diff --

Maybe it's sort of redundant, but we've seen enough people running 
different HDFS services under different Kerberos realms that I think it should 
be mentioned here that the user running the Spark job needs to be able to 
access all the listed NNs (either by them being on the same realm or in a 
trusted realm).

Also, nits: backquotes around the example, and capitalize "Spark" before 
application.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/566#issuecomment-46724002
  
Jenkins, test this again. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46724012
  
I'm going to merge this in master & branch-1.0. I will create a separate 
ticket to track progress on HAVING. Basically there are two things missing:

1. HAVING without GROUP BY should just become a normal WHERE
2. HAVING should be able to contain aggregate expressions that don't appear 
in the aggregation list. This test contains that: 
https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tmalaska
Github user tmalaska commented on the pull request:

https://github.com/apache/spark/pull/566#issuecomment-46724152
  
Let me know if there is anything I can do to help this go through.

Thanks tdas


On Fri, Jun 20, 2014 at 4:38 PM, Tathagata Das 
wrote:

> Jenkins, test this again.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/566#discussion_r14040312
  
--- Diff: 
external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala
 ---
@@ -134,22 +144,64 @@ private[streaming]
 class FlumeReceiver(
 host: String,
 port: Int,
-storageLevel: StorageLevel
+storageLevel: StorageLevel,
+enableDecompression: Boolean
   ) extends Receiver[SparkFlumeEvent](storageLevel) with Logging {
 
   lazy val responder = new SpecificResponder(
 classOf[AvroSourceProtocol], new FlumeEventServer(this))
-  lazy val server = new NettyServer(responder, new InetSocketAddress(host, 
port))
+  var server: NettyServer = null
+
+  private def initServer() = {
+if (enableDecompression) {
+  val channelFactory = new NioServerSocketChannelFactory
+(Executors.newCachedThreadPool(), Executors.newCachedThreadPool());
+  val channelPipelieFactory = new CompressionChannelPipelineFactory()
+  
+  new NettyServer(
+responder, 
+new InetSocketAddress(host, port),
+channelFactory, 
+channelPipelieFactory, 
+null)
+} else {
+  new NettyServer(responder, new InetSocketAddress(host, port))
+}
+  }
 
   def onStart() {
-server.start()
+synchronized {
+  if (server == null) {
+server = initServer()
+server.start()
+  } else {
+logWarning("Flume receiver being asked to start more then once 
with out close")
+  }
+}
 logInfo("Flume receiver started")
   }
 
   def onStop() {
-server.close()
+synchronized {
+  if (server != null) {
+server.close()
+server = null
+  }
+}
 logInfo("Flume receiver stopped")
   }
 
   override def preferredLocation = Some(host)
 }
+
+private[streaming]
+class CompressionChannelPipelineFactory extends ChannelPipelineFactory {
+
+  def getPipeline() = {
+  val pipeline = Channels.pipeline()
--- End diff --

Formatting issue. 2 space indents required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1136


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46724443
  
@rxin, re: the former, seems like most implementations signal this as an 
error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/566#discussion_r14040366
  
--- Diff: 
external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala
 ---
@@ -134,22 +144,64 @@ private[streaming]
 class FlumeReceiver(
 host: String,
 port: Int,
-storageLevel: StorageLevel
+storageLevel: StorageLevel,
+enableDecompression: Boolean
   ) extends Receiver[SparkFlumeEvent](storageLevel) with Logging {
 
   lazy val responder = new SpecificResponder(
 classOf[AvroSourceProtocol], new FlumeEventServer(this))
-  lazy val server = new NettyServer(responder, new InetSocketAddress(host, 
port))
+  var server: NettyServer = null
+
+  private def initServer() = {
+if (enableDecompression) {
+  val channelFactory = new NioServerSocketChannelFactory
+(Executors.newCachedThreadPool(), Executors.newCachedThreadPool());
+  val channelPipelieFactory = new CompressionChannelPipelineFactory()
+  
+  new NettyServer(
+responder, 
+new InetSocketAddress(host, port),
+channelFactory, 
+channelPipelieFactory, 
+null)
+} else {
+  new NettyServer(responder, new InetSocketAddress(host, port))
+}
+  }
 
   def onStart() {
-server.start()
+synchronized {
+  if (server == null) {
+server = initServer()
+server.start()
+  } else {
+logWarning("Flume receiver being asked to start more then once 
with out close")
+  }
+}
 logInfo("Flume receiver started")
   }
 
   def onStop() {
-server.close()
+synchronized {
+  if (server != null) {
+server.close()
+server = null
+  }
+}
 logInfo("Flume receiver stopped")
   }
 
   override def preferredLocation = Some(host)
 }
+
+private[streaming]
--- End diff --

Can you add comments to this class, explaining what this class does and why 
it is necessary?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/566#discussion_r14040386
  
--- Diff: 
external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala
 ---
@@ -134,22 +144,64 @@ private[streaming]
 class FlumeReceiver(
 host: String,
 port: Int,
-storageLevel: StorageLevel
+storageLevel: StorageLevel,
+enableDecompression: Boolean
   ) extends Receiver[SparkFlumeEvent](storageLevel) with Logging {
 
   lazy val responder = new SpecificResponder(
 classOf[AvroSourceProtocol], new FlumeEventServer(this))
-  lazy val server = new NettyServer(responder, new InetSocketAddress(host, 
port))
+  var server: NettyServer = null
+
+  private def initServer() = {
+if (enableDecompression) {
+  val channelFactory = new NioServerSocketChannelFactory
+(Executors.newCachedThreadPool(), Executors.newCachedThreadPool());
+  val channelPipelieFactory = new CompressionChannelPipelineFactory()
+  
+  new NettyServer(
+responder, 
+new InetSocketAddress(host, port),
+channelFactory, 
+channelPipelieFactory, 
+null)
+} else {
+  new NettyServer(responder, new InetSocketAddress(host, port))
+}
+  }
 
   def onStart() {
-server.start()
+synchronized {
+  if (server == null) {
+server = initServer()
+server.start()
+  } else {
+logWarning("Flume receiver being asked to start more then once 
with out close")
+  }
+}
 logInfo("Flume receiver started")
   }
 
   def onStop() {
-server.close()
+synchronized {
+  if (server != null) {
+server.close()
+server = null
+  }
+}
 logInfo("Flume receiver stopped")
   }
 
   override def preferredLocation = Some(host)
 }
+
+private[streaming]
+class CompressionChannelPipelineFactory extends ChannelPipelineFactory {
+
+  def getPipeline() = {
--- End diff --

Just a line of comment saying what pipeline does this return. For Flume 
noob's like me ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tmalaska
Github user tmalaska commented on a diff in the pull request:

https://github.com/apache/spark/pull/566#discussion_r14040514
  
--- Diff: 
external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala
 ---
@@ -134,22 +144,64 @@ private[streaming]
 class FlumeReceiver(
 host: String,
 port: Int,
-storageLevel: StorageLevel
+storageLevel: StorageLevel,
+enableDecompression: Boolean
   ) extends Receiver[SparkFlumeEvent](storageLevel) with Logging {
 
   lazy val responder = new SpecificResponder(
 classOf[AvroSourceProtocol], new FlumeEventServer(this))
-  lazy val server = new NettyServer(responder, new InetSocketAddress(host, 
port))
+  var server: NettyServer = null
+
+  private def initServer() = {
+if (enableDecompression) {
+  val channelFactory = new NioServerSocketChannelFactory
+(Executors.newCachedThreadPool(), Executors.newCachedThreadPool());
+  val channelPipelieFactory = new CompressionChannelPipelineFactory()
+  
+  new NettyServer(
+responder, 
+new InetSocketAddress(host, port),
+channelFactory, 
+channelPipelieFactory, 
+null)
+} else {
+  new NettyServer(responder, new InetSocketAddress(host, port))
+}
+  }
 
   def onStart() {
-server.start()
+synchronized {
+  if (server == null) {
+server = initServer()
+server.start()
+  } else {
+logWarning("Flume receiver being asked to start more then once 
with out close")
+  }
+}
 logInfo("Flume receiver started")
   }
 
   def onStop() {
-server.close()
+synchronized {
+  if (server != null) {
+server.close()
+server = null
+  }
+}
 logInfo("Flume receiver stopped")
   }
 
   override def preferredLocation = Some(host)
 }
+
+private[streaming]
+class CompressionChannelPipelineFactory extends ChannelPipelineFactory {
+
+  def getPipeline() = {
--- End diff --

Cool will do before the weekend is done.  Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/566#discussion_r14040594
  
--- Diff: 
external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala
 ---
@@ -36,17 +36,27 @@ import org.apache.spark.streaming.StreamingContext
 import org.apache.spark.streaming.dstream._
 import org.apache.spark.Logging
 import org.apache.spark.streaming.receiver.Receiver
+import org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
+import org.jboss.netty.channel.ChannelPipelineFactory
+import java.util.concurrent.Executors
+import org.jboss.netty.channel.Channels
+import org.jboss.netty.handler.codec.compression.ZlibDecoder
--- End diff --

please dedup, and sort. see import style in 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14040631
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1062,10 +1062,15 @@ class DAGScheduler(
   // This is the only job that uses this stage, so fail the stage 
if it is running.
   val stage = stageIdToStage(stageId)
   if (runningStages.contains(stage)) {
-taskScheduler.cancelTasks(stageId, shouldInterruptThread)
-val stageInfo = stageToInfos(stage)
-stageInfo.stageFailed(failureReason)
-
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+try { // cancelTasks will fail if a SchedulerBackend does not 
implement killTask
+  taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+  val stageInfo = stageToInfos(stage)
+  stageInfo.stageFailed(failureReason)
+  
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+} catch {
+  case e: UnsupportedOperationException =>
+logInfo(s"Could not cancel tasks for stage $stageId", e)
+}
--- End diff --

Hmmm... not sure that I agree.  A job being cancelled, stages being 
cancelled, and tasks being cancelled are all different things.  The expectation 
is that job cancellation will lead to cancellation of independent stages and 
their associated tasks; but if no stages and tasks get cancelled, it's probably 
still worthwhile for the information to be sent that the job itself was 
cancelled.  I expect that eventually all of the backends will support task 
killing, so this whole no-kill path should never be hit.  But moving the job 
cancellation notification within the try-to-cancelTasks block will result in 
multiple notifications that the parent job was cancelled -- one for each 
independent stage cancellation.  Or am I misunderstanding what you are 
suggesting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/566#issuecomment-46725214
  
Sorry Ted, that this has been sitting here for so long. Will get this in 
ASAP. 
Other than a few nit, it LGTM. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46725494
  
BTW two follow up tickets created:

https://issues.apache.org/jira/browse/SPARK-2225

https://issues.apache.org/jira/browse/SPARK-2226

Let me know if you'd like to work on them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46725451
  
There are databases that support that, and it seems to me a very simple 
change (actually just removing the check code you added is probably enough).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46725581
  
OK, I wasn't sure if strict Hive compatibility was the goal.  I'm happy to 
take these tickets.  Thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14040933
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1062,10 +1062,15 @@ class DAGScheduler(
   // This is the only job that uses this stage, so fail the stage 
if it is running.
   val stage = stageIdToStage(stageId)
   if (runningStages.contains(stage)) {
-taskScheduler.cancelTasks(stageId, shouldInterruptThread)
-val stageInfo = stageToInfos(stage)
-stageInfo.stageFailed(failureReason)
-
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+try { // cancelTasks will fail if a SchedulerBackend does not 
implement killTask
+  taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+  val stageInfo = stageToInfos(stage)
+  stageInfo.stageFailed(failureReason)
+  
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+} catch {
+  case e: UnsupportedOperationException =>
+logInfo(s"Could not cancel tasks for stage $stageId", e)
+}
--- End diff --

Ah you're right that it doesn't make sense to add that here because it will 
be called for each stage.  My intention was that if the job has running stages 
that don't get cancelled (because the task scheduler doesn't implement 
cancelTasks()), then we should not call job.listener.jobFailed() -- do you 
think that makes sense?  Seems like the way to implement that would be to set a 
boolean flag here if the job can't be successfully cancelled, and then call 
jobFailed() 0 or 1 times at the end of this function depending on that flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1478: Upgrade FlumeInputDStream's FlumeR...

2014-06-20 Thread tmalaska
Github user tmalaska commented on the pull request:

https://github.com/apache/spark/pull/566#issuecomment-46726131
  
No worries.  I'm starting to free up so I would love to do more work.  I 
will finish this one up then the Flume encryption one.  Then if you have 
anything else. Let me at it.

Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46726272
  
I actually did 2225 already. I will assign 2226 to you. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---



[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14041296
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1062,10 +1062,15 @@ class DAGScheduler(
   // This is the only job that uses this stage, so fail the stage 
if it is running.
   val stage = stageIdToStage(stageId)
   if (runningStages.contains(stage)) {
-taskScheduler.cancelTasks(stageId, shouldInterruptThread)
-val stageInfo = stageToInfos(stage)
-stageInfo.stageFailed(failureReason)
-
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+try { // cancelTasks will fail if a SchedulerBackend does not 
implement killTask
+  taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+  val stageInfo = stageToInfos(stage)
+  stageInfo.stageFailed(failureReason)
+  
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+} catch {
+  case e: UnsupportedOperationException =>
+logInfo(s"Could not cancel tasks for stage $stageId", e)
+}
--- End diff --

What you suggest could be done, but there's a question of whether or not 
notification of cancellation of the job should be made regardless of whether 
any stages and task are successfully cancelled as a consequence.  I don't 
really know how to answer that because I don't  know how all of the listeners 
are using the notification or whether they are all expecting the same semantics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...

2014-06-20 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1161

[SPARK-2225] Turn HAVING without GROUP BY into WHERE.

@willb



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark having-filter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1161


commit fa8359ae2b0b4f025582e033a3338d1492642c90
Author: Reynold Xin 
Date:   2014-06-20T21:10:55Z

[SPARK-2225] Turn HAVING without GROUP BY into WHERE.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14041515
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1062,10 +1062,15 @@ class DAGScheduler(
   // This is the only job that uses this stage, so fail the stage 
if it is running.
   val stage = stageIdToStage(stageId)
   if (runningStages.contains(stage)) {
-taskScheduler.cancelTasks(stageId, shouldInterruptThread)
-val stageInfo = stageToInfos(stage)
-stageInfo.stageFailed(failureReason)
-
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+try { // cancelTasks will fail if a SchedulerBackend does not 
implement killTask
+  taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+  val stageInfo = stageToInfos(stage)
+  stageInfo.stageFailed(failureReason)
+  
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+} catch {
+  case e: UnsupportedOperationException =>
+logInfo(s"Could not cancel tasks for stage $stageId", e)
+}
--- End diff --

@pwendell what do you think here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2168 Spark core

2014-06-20 Thread elyast
GitHub user elyast opened a pull request:

https://github.com/apache/spark/pull/1160

SPARK-2168 Spark core

Removing full URI leaving only relative path in link to the completed 
application plus unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/elyast/spark branch-1.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1160.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1160


commit a362f359bcbd97d4c4cc4b3c21e56057e762be5d
Author: Lukasz Jastrzebski 
Date:   2014-06-19T19:22:38Z

fixed links on history page

commit dc1f70a58ea8babbd3a75ab57ff272415470d3eb
Author: Lukasz Jastrzebski 
Date:   2014-06-20T21:09:48Z

adding test for history server suite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1161#issuecomment-46727408
  
LGTM; this is basically exactly what I did 
(https://github.com/willb/spark/commit/b272f6be925ba50741e0a5093244926ea4a7a9a8)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1161#issuecomment-46727465
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1161#issuecomment-46727480
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2168 Spark core

2014-06-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1160#issuecomment-46727469
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1749] Job cancellation when SchedulerBa...

2014-06-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/686#discussion_r14041700
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1062,10 +1062,15 @@ class DAGScheduler(
   // This is the only job that uses this stage, so fail the stage 
if it is running.
   val stage = stageIdToStage(stageId)
   if (runningStages.contains(stage)) {
-taskScheduler.cancelTasks(stageId, shouldInterruptThread)
-val stageInfo = stageToInfos(stage)
-stageInfo.stageFailed(failureReason)
-
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+try { // cancelTasks will fail if a SchedulerBackend does not 
implement killTask
+  taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+  val stageInfo = stageToInfos(stage)
+  stageInfo.stageFailed(failureReason)
+  
listenerBus.post(SparkListenerStageCompleted(stageToInfos(stage)))
+} catch {
+  case e: UnsupportedOperationException =>
+logInfo(s"Could not cancel tasks for stage $stageId", e)
+}
--- End diff --

Do we have the meaning of all the listener events fully documented 
someplace?  Or perhaps that needs to be done in a separate PR and then 
DAGScheduler updated to match the documented expectation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >