[GitHub] spark pull request: [SPARK-2678][Core] Prevents `spark-submit` fro...

2014-07-31 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1699#issuecomment-50854779
  
Hey @liancheng I just spoke with @mateiz offline about this for a while. He 
had feedback on a few things which I'll summarize here. There are a couple 
orthogonal issues going on here. Here were his suggestions:

1. _Scripts like start-thriftserver.sh should expose one coherent set of 
options rather than distinguishing between spark-submit and other options. _ 
The idea was to make this more convenient/less confusing for users. So this 
would mean that the scripts would have to match on the more specific options 
and make sure those are delivered separately to `spark-submit`. This makes the 
internals more complicated, but for the users it's simpler. Also, because we 
control the total set of options for internal tools, we can make sure there is 
no collision in the naming.

2. _ If we implement `--` as a divider, it should be fully backwards 
compatible. _ For instance, we need to support users that were doing this in 
Spark 1.0:

```
./spark-submit --master local myJar.jar --userOpt a --userOpt b -- 
--userOpt c
```

i.e. user programs that used `--` in their own options. The way to fully 
support this is to make the use of `--` mutually exclusive with specifying a 
primary resource. So this means a user can _either_ do:

```
./spark-submit --master local --jars myJar.jar -- --userOpt a --userOpt b
```

or they can do 

```
./spark-submit --master local myJar.jar -- --userOpt a --userOpt b
```

So basically, when the parser arrives at an unrecognized option (which we 
assume to be a resource) we always treat the rest of the list as user options, 
even if the user options happen to have a `--` in them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1713#issuecomment-50854587
  
QA tests have started for PR 1713. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17657/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50854552
  
QA results for PR 1598:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class DataType(object):class 
PrimitiveType(DataType):class StringType(PrimitiveType):class 
BinaryType(PrimitiveType):class BooleanType(PrimitiveType):class 
TimestampType(PrimitiveType):class DecimalType(PrimitiveType):class 
DoubleType(PrimitiveType):class FloatType(PrimitiveType):class 
ByteType(PrimitiveType):class IntegerType(PrimitiveType):class 
LongType(PrimitiveType):class ShortType(PrimitiveType):class 
ArrayType(DataType):class MapType(DataType):class 
StructField(DataType):class StructType(DataType):class 
List(list):class Dict(dict):class Row(tuple):class 
Row(tuple):For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17648/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15684100
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
--- End diff --

Removing close() actually now requires a very minor refactor of 
ExternalSorter for the `objectsWritten == 0` case -- I'd actually rather not 
risk it in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-50854272
  
QA tests have started for PR 855. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17656/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-07-31 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50854265
  
Hi, @javadba.
My proposal is as follows:

a) `CharLength` expression = codePoints, and use `char_length` function in 
parser and `length` as synonym of `char_length` for hive compatibility
b) `OctetLength` expression = bytes to show amount of byte storage for a 
string, and use `octet_length` function in parser

Standard SQL defines `char_length` and `octet_length` functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50854138
  
@witgo Could you update the pom to exclude `commons-math3` from 
dependencies? I tried at local and LBFGS works well. It should be safe to 
remove `commons-math3`. For scalamacros, catalyst also depends on it. My diff 
is at:


https://github.com/mengxr/spark/compare/apache:master...mengxr:breeze-deps?expand=1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15683958
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
 if (initialized) {
   // NOTE: Because Kryo doesn't flush the underlying stream we 
explicitly flush both the
   //   serializer stream and the lower level stream.
   objOut.flush()
   bs.flush()
-  val prevPos = lastValidPosition
-  lastValidPosition = channel.position()
-  lastValidPosition - prevPos
-} else {
-  // lastValidPosition is zero if stream is uninitialized
-  lastValidPosition
+  close()
 }
+finalPosition = file.length()
   }
 
-  override def revertPartialWrites() {
-if (initialized) {
-  // Discard current writes. We do this by flushing the outstanding 
writes and
-  // truncate the file to the last valid position.
-  objOut.flush()
-  bs.flush()
-  channel.truncate(lastValidPosition)
+  // Discard current writes. We do this by flushing the outstanding writes 
and then
+  // truncating the file to its initial position.
+  override def revertPartialWritesAndClose() {
+try {
+  if (initialized) {
+objOut.flush()
+bs.flush()
+close()
+  }
+
+  val truncateStream = new FileOutputStream(file, true)
+  try {
+truncateStream.getChannel.truncate(initialPosition)
+  } finally {
+truncateStream.close()
+  }
+} catch {
+  case e: Exception =>
+logError("Uncaught exception while reverting partial writes to 
file " + file, e)
--- End diff --

I'm not certain I understand. The situation I am imagining is that we 
commit to the first Writer, then the second one fails. In HashShuffleWriter, we 
will then call revertPartialWritesAndClose() on all Writers, causing us to 
revert all the changes back to "initialPosition", which should revert even the 
committed data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50853874
  
Thanks @pwendell and @andrewor14 for your continued reviews.

10 seconds sounds fine to me.  Not that it's a shining beacon of 
performance, but MapReduce actually uses task->application master heartbeats in 
exactly the same way. I.e. it doesn't rely on them for them for starting or 
stopping tasks. MR AMs will actually receive heartbeats more frequently than 
Spark drivers, as there's one per task instead of one per executor.  I just 
checked and the interval there is 3 seconds.

It might be best to base the interval on the number of executors, but 
that's probably work for a separate patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50853746
  
QA results for PR 1290:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):abstract class GeneralizedSteepestDescendModel(val weights: 
Vector )trait ANN {class LeastSquaresGradientANN(class ANNUpdater 
extends Updater {class ParallelANN (For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1464#issuecomment-50853616
  
QA tests have started for PR 1464. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17655/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-50853502
  
QA results for PR 855:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17654/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2756] [mllib] Decision tree bug fixes

2014-07-31 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1673#issuecomment-50853476
  
@manishamde  I'll work on some regression testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-50853335
  
QA tests have started for PR 855. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17654/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1713#issuecomment-50853374
  
QA results for PR 1713:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class Statistics(object):For more information see 
test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17653/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-07-31 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1464#issuecomment-50853289
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1713#issuecomment-50853328
  
QA tests have started for PR 1713. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17653/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Spark 2557] fix LOCAL_N_REGEX in createTaskSc...

2014-07-31 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1464#issuecomment-50853278
  
This seems totally reasonable, apologies for not seeing this. LGTM once 
Jenkins passes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

2014-07-31 Thread dorx
GitHub user dorx opened a pull request:

https://github.com/apache/spark/pull/1713

[SPARK-2786][mllib] Python correlations



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dorx/spark pythonCorrelation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1713.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1713


commit cd163d6e8d3535d8dac1dd8321831092ef52c995
Author: Doris Xin 
Date:   2014-07-29T18:44:07Z

WIP

commit d199f1fc84113c0eb541f895ea04ae57fdd77de2
Author: Doris Xin 
Date:   2014-07-29T21:20:36Z

Moved correlation names into a public object

commit 9141a637770f62eae950d28013537a7b1812229f
Author: Doris Xin 
Date:   2014-07-31T19:24:15Z

WIP2

commit cc9f725ec54ae294bec9caf0986b9130cc407a7d
Author: Doris Xin 
Date:   2014-08-01T06:00:09Z

units passed.

commit eb5bf5692d63226bf71777b48c63c18fde4f38d4
Author: Doris Xin 
Date:   2014-08-01T06:11:57Z

merge master

commit e69d4462c394eb11d9b7f8b398666ce36c026dec
Author: Doris Xin 
Date:   2014-08-01T06:21:45Z

fixed missed conflicts.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] upgrade dependency to scala-loggi...

2014-07-31 Thread avati
Github user avati commented on a diff in the pull request:

https://github.com/apache/spark/pull/1701#discussion_r15683518
  
--- Diff: sql/core/pom.xml ---
@@ -83,6 +83,16 @@
   scalacheck_${scala.binary.version}
   test
 
+
+  com.typesafe.scala-logging
+  scala-logging-slf4j_${scala.binary.version}
+  2.1.2
+
+
+  com.typesafe.scala-logging
+  scala-logging-api_${scala.binary.version}
--- End diff --

scalalogging-slf4j-1.0.1 seems to be split into scala-logging-{slf4j,api} 
for 2.1.2. Previously scalalogging was missing in sql/core pom.xml (usually not 
a problem unless you do mvn compile within the module)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-07-31 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-50852470
  
Jenkins, test this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2103][Streaming] Change to ClassTag for...

2014-07-31 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/1508#issuecomment-50852462
  
OK, this is very confusing, sequence of two results is very confusing. Let 
me run the tests again. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] upgrade dependency to scala-loggi...

2014-07-31 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/1701#discussion_r15683473
  
--- Diff: sql/core/pom.xml ---
@@ -83,6 +83,16 @@
   scalacheck_${scala.binary.version}
   test
 
+
+  com.typesafe.scala-logging
+  scala-logging-slf4j_${scala.binary.version}
+  2.1.2
+
+
+  com.typesafe.scala-logging
+  scala-logging-api_${scala.binary.version}
--- End diff --

Can you explain why we need the second artifact ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1513#issuecomment-50852319
  
QA results for PR 1513:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17650/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] upgrade dependency to scala-loggi...

2014-07-31 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1701#issuecomment-50852258
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...

2014-07-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1651


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1635


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50852101
  
QA tests have started for PR 1369. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17651/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1513#issuecomment-50852097
  
QA tests have started for PR 1513. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17650/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-07-31 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50851936
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2608] fix executor backend launch commo...

2014-07-31 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1513#issuecomment-50851919
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] A minor refactoring Java dat...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1712#issuecomment-50851677
  
QA results for PR 1712:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17643/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15683224
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
--- End diff --

When I merged the sort patch, and modified EAOM, it was simply replace 
close with commitAndClose.
commitAndClose should be semantically equivalent to close actually.
It is not equivalent to commit() - but we want to remove that :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2201 Improve FlumeInputDStream's stabili...

2014-07-31 Thread joyyoj
Github user joyyoj commented on the pull request:

https://github.com/apache/spark/pull/1310#issuecomment-50851611
  
Sorry, I'll soon send a PR. 
The problem of the original implementation is that the config(host:port) is 
static and allows only one host:port. Once host or port changed, the flume 
agent should be restarted to reload the conf.
To solve it, one solution is to set a virtual address instead of a real 
address in the flume conf. Meanwhile, a address router was introduced that can 
tell us all the real addresses are bound to a virtual address and notify such 
events that a real address is added to or removed from the virtual address.
I found the router can be easily implemented by the zookeeper. In such 
scenario:
1. A spark receiver selects a free port and creates a tmp node with the 
path /path/to/logicalhost/host:port to zookeeper when started. 
If three receivers started, three nodes (host1:port1, host2:port2, 
host3:port3) will be created under /path/to/logicalhost;
2. On the side of flume agent, the flume sink gets the children nodes 
(host1:port1, host2:port2, host3:port3) from /path/to/logicalhost and buffers 
them into a ClientPool.
When append called, it selects a client from ClientPool in a round-robin 
manner and call client.append to send events.
3. If any receiver crashed/started, the tmp zk node will be removed/added, 
and then ClientPool will remove/add the client from the buffer since it watched 
those zk children events.
In my implementation:
LogicalHostRouter is the implementation of the address router. You know, 
the spark or flume should not know the existence of zk. 
The ZkProxy is an encapsulation of the zk curator client.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15683205
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
 if (initialized) {
   // NOTE: Because Kryo doesn't flush the underlying stream we 
explicitly flush both the
   //   serializer stream and the lower level stream.
   objOut.flush()
   bs.flush()
-  val prevPos = lastValidPosition
-  lastValidPosition = channel.position()
-  lastValidPosition - prevPos
-} else {
-  // lastValidPosition is zero if stream is uninitialized
-  lastValidPosition
+  close()
 }
+finalPosition = file.length()
   }
 
-  override def revertPartialWrites() {
-if (initialized) {
-  // Discard current writes. We do this by flushing the outstanding 
writes and
-  // truncate the file to the last valid position.
-  objOut.flush()
-  bs.flush()
-  channel.truncate(lastValidPosition)
+  // Discard current writes. We do this by flushing the outstanding writes 
and then
+  // truncating the file to its initial position.
+  override def revertPartialWritesAndClose() {
+try {
+  if (initialized) {
+objOut.flush()
+bs.flush()
+close()
+  }
+
+  val truncateStream = new FileOutputStream(file, true)
+  try {
+truncateStream.getChannel.truncate(initialPosition)
+  } finally {
+truncateStream.close()
+  }
+} catch {
+  case e: Exception =>
+logError("Uncaught exception while reverting partial writes to 
file " + file, e)
--- End diff --

I meant the former case : close on a writer fails with an exception; while 
earlier streams succeeded.
So now we have some writers which have committed data (which is not removed 
by subsequent revert) while others are reverted.

On the face of it, I agree, it should not cause issues : but then since the 
expectation from this class is never enforced; and so can silently fail. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-31 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50851433
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2316] Avoid O(blocks) operations i...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50851416
  
QA results for PR 1679:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class StorageStatus(val blockManagerId: BlockManagerId, val 
maxMem: Long) {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17644/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...

2014-07-31 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1651#issuecomment-50851382
  
Okay I've merged this - thanks HY


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2711. Create a ShuffleMemoryManager to t...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1707#issuecomment-50851286
  
QA results for PR 1707:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17645/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-983. Support external sorting in sortByK...

2014-07-31 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1677#issuecomment-50851225
  
Go for it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50851243
  
That sounds good to me but I'm not familiar with the tasks related to Scala 
2.11. Please run the discussion on 
https://issues.apache.org/jira/browse/SPARK-1812


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50851118
  
QA tests have started for PR 1290. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17649/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-31 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50851021
  
Thanks a lot! I have added the extension now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread avati
Github user avati commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50850842
  
> But is it needed for the v1.1 release? Spark v1.1 doesn't support Scala
> 2.11.
>

Not, I guess. I din't realize Spark 1.1 was not yet release branched. I
will try to move Scala 2.11 related dependency changes into a separate
build profile then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2010] [PySpark] [SQL] support nested st...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1598#issuecomment-50850668
  
QA tests have started for PR 1598. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17648/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50850626
  
But is it needed for the v1.1 release? Spark v1.1 doesn't support Scala 
2.11.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2635] Fix race condition at SchedulerBa...

2014-07-31 Thread li-zhihui
Github user li-zhihui commented on the pull request:

https://github.com/apache/spark/pull/1525#issuecomment-50850584
  
@tgravescs @kayousterhout can you close this PR before code frozen of 1.1 
release? Otherwise, it would result in incompatible configuration property name 
because the PR rename 
spark.scheduler.maxRegisteredExecutorsWaitingTime to 
spark.scheduler.maxRegisteredResourcesWaitingTime


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread avati
Github user avati commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50850441
  
> Yes, it is already a problem with breeze 0.7. But we didn't realized that
> hadoop 2.3 depends on commons-math3 in the Spark v1.0 release. If there is
> a way to avoid including commons-math3, we should do that.
>

I think that sounds like an orthogonal problem to 0.7 vs 0.8.1. Upgrading
to 0.8.1 will help long way towards Scala 2.11 porting, without worsening
the commons-math3 issue. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50850308
  
Yes, it is already a problem with breeze 0.7. But we didn't realized that 
hadoop 2.3 depends on commons-math3 in the Spark v1.0 release. If there is a 
way to avoid including commons-math3, we should do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50850298
  
QA results for PR 1056:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class SparkListenerExecutorMetricsUpdate(case class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17646/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] sql/catalyst - remove scala.NotNu...

2014-07-31 Thread avati
Github user avati commented on the pull request:

https://github.com/apache/spark/pull/1709#issuecomment-50850247
  
> These tests shouldn't be using scala NotNull, but catalyst's .notNull
> 

> .
>

Hmm, now that I look more carefully what you say makes sense. Trying to
figure out why notNull (catalyst's) was not found when compiling with Scala
2.11. The missing scala.NotNull in 2.11 was just a close coincidence I
think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50850262
  
QA results for PR 1635:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17642/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1997] update breeze to version 0.8.1

2014-07-31 Thread avati
Github user avati commented on the pull request:

https://github.com/apache/spark/pull/940#issuecomment-50849972
  
@mengxr looking at the dependency graphs of breeze 0.7 and 0.8.1, it 
appears that both the versions are depending on commons-math3:3.2. If hadoop 
2.3 and 2.4 depend on commons-math3:3.1.1, then it is already a problem with 
breeze 0.7 itself, no? Am I overlooking something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2781][SQL] Check resolution of LogicalP...

2014-07-31 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-50849916
  
Sure, fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] sql/catalyst - remove scala.NotNu...

2014-07-31 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1709#issuecomment-50849795
  
These tests shouldn't be using scala NotNull, but catalyst's 
[.notNull](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L199).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15682607
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleWriter.scala ---
@@ -120,8 +121,7 @@ private[spark] class HashShuffleWriter[K, V](
   private def revertWrites(): Unit = {
 if (shuffle != null && shuffle.writers != null) {
   for (writer <- shuffle.writers) {
-writer.revertPartialWrites()
-writer.close()
+writer.revertPartialWritesAndClose()
--- End diff --

Revert actually doesn't throw, per its (updated) comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15682605
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
--- End diff --

Absolutely -- I did not do that in this patch because ExternalAppendOnlyMap 
did a close without a commit, which is a fix outside of the scope of this PR, 
but definitely one that should be made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15682590
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
 if (initialized) {
   // NOTE: Because Kryo doesn't flush the underlying stream we 
explicitly flush both the
   //   serializer stream and the lower level stream.
   objOut.flush()
   bs.flush()
-  val prevPos = lastValidPosition
-  lastValidPosition = channel.position()
-  lastValidPosition - prevPos
-} else {
-  // lastValidPosition is zero if stream is uninitialized
-  lastValidPosition
+  close()
 }
+finalPosition = file.length()
   }
 
-  override def revertPartialWrites() {
-if (initialized) {
-  // Discard current writes. We do this by flushing the outstanding 
writes and
-  // truncate the file to the last valid position.
-  objOut.flush()
-  bs.flush()
-  channel.truncate(lastValidPosition)
+  // Discard current writes. We do this by flushing the outstanding writes 
and then
+  // truncating the file to its initial position.
+  override def revertPartialWritesAndClose() {
+try {
+  if (initialized) {
+objOut.flush()
+bs.flush()
+close()
+  }
+
+  val truncateStream = new FileOutputStream(file, true)
+  try {
+truncateStream.getChannel.truncate(initialPosition)
+  } finally {
+truncateStream.close()
+  }
+} catch {
+  case e: Exception =>
+logError("Uncaught exception while reverting partial writes to 
file " + file, e)
--- End diff --

Closed streams should not inherently throw (since we check `initialized` 
before flushing and closing). However, we may be left with leftover data, as 
you said. I don't see a way to prevent the possibility of that occurring, but 
it should be possible to recover if users only rely on the returned 
fileSegment().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-31 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15682567
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, DeveloperApi}
+import org.apache.spark.streaming.dstream.DStream
+
+/**
+ * :: DeveloperApi ::
+ * StreamingRegression implements methods for training
+ * a linear regression model on streaming data, and using it
+ * for prediction on streaming data.
+ *
+ * This class takes as type parameters a GeneralizedLinearModel,
+ * and a GeneralizedLinearAlgorithm, making it easy to extend to construct
+ * streaming versions of arbitrary regression analyses. For example usage,
+ * see StreamingLinearRegressionWithSGD.
+ *
+ */
+@DeveloperApi
+@Experimental
+abstract class StreamingRegression[
+M <: GeneralizedLinearModel,
+A <: GeneralizedLinearAlgorithm[M]] extends Logging {
+
+  /** The model to be updated and used for prediction. */
+  var model: M
+
+  /** The algorithm to use for updating. */
+  val algorithm: A
+
+  /** Return the latest model. */
+  def latest(): M = {
+model
+  }
+
+  /**
+   * Update the model by training on batches of data from a DStream.
+   * This operation registers a DStream for training the model,
+   * and updates the model based on every subsequent non-empty
+   * batch of data from the stream.
+   *
+   * @param data DStream containing labeled data
+   */
+  def trainOn(data: DStream[LabeledPoint]) {
+data.foreachRDD{
+  rdd =>
+if (rdd.count() > 0) {
+  model = algorithm.run(rdd, model.weights)
+  logInfo("Model updated")
+}
+logInfo("Current model: weights, 
%s".format(model.weights.toString))
+logInfo("Current model: intercept, 
%s".format(model.intercept.toString))
--- End diff --

Ok, good points, agreed it's safer. I'll make sure there's a note about 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2781] Check resolution of LogicalPlans ...

2014-07-31 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1706#issuecomment-50849459
  
@staple can you add `[SQL]` to the title of this PR? That way it gets 
filtered properly by our internal sorting tools.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1698#issuecomment-50849409
  
Your implementation calls `reduceByKey` and `cartesian`. Those are not 
cheap streamline operations. `map(x => (1, x)).reduceByKey` is the same as 
`reduce` except that it reduces to some executor instead of the driver. Then 
`cartesian` is the same as `broadcast` but `broadcast` is more efficient with 
TorrentBroadcast. You can compare the performance and see the difference. 
`OnlineSummarizer` also uses a more accurate approach to compute the variance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15682457
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
--- End diff --

We should remove close from the interface, and make it private to this 
class btw.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50849206
  
Sandy - I took a pass on this. Mostly minor comments, but I did propose 
lowering the default frequency from 2 seconds. Overall this is looking in good 
shape.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2777][MLLIB] change ALS factors storage...

2014-07-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1700


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1710


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2766: ScalaReflectionSuite throw an lleg...

2014-07-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1683


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15682412
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -147,28 +147,36 @@ private[spark] class DiskBlockObjectWriter(
 
   override def isOpen: Boolean = objOut != null
 
-  override def commit(): Long = {
+  override def commitAndClose(): Unit = {
 if (initialized) {
   // NOTE: Because Kryo doesn't flush the underlying stream we 
explicitly flush both the
   //   serializer stream and the lower level stream.
   objOut.flush()
   bs.flush()
-  val prevPos = lastValidPosition
-  lastValidPosition = channel.position()
-  lastValidPosition - prevPos
-} else {
-  // lastValidPosition is zero if stream is uninitialized
-  lastValidPosition
+  close()
 }
+finalPosition = file.length()
   }
 
-  override def revertPartialWrites() {
-if (initialized) {
-  // Discard current writes. We do this by flushing the outstanding 
writes and
-  // truncate the file to the last valid position.
-  objOut.flush()
-  bs.flush()
-  channel.truncate(lastValidPosition)
+  // Discard current writes. We do this by flushing the outstanding writes 
and then
+  // truncating the file to its initial position.
+  override def revertPartialWritesAndClose() {
+try {
+  if (initialized) {
+objOut.flush()
+bs.flush()
+close()
+  }
+
+  val truncateStream = new FileOutputStream(file, true)
+  try {
+truncateStream.getChannel.truncate(initialPosition)
+  } finally {
+truncateStream.close()
+  }
+} catch {
+  case e: Exception =>
+logError("Uncaught exception while reverting partial writes to 
file " + file, e)
--- End diff --

In the use of writers in HashShuffleWriter, it is possible for a closed 
stream to be reverted (if some other stream's close failed for example).
In the above, that will leave this file with leftover data - I am not sure 
what the impact of this would be.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15682407
  
--- Diff: 
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
@@ -171,28 +155,70 @@ class JobProgressListener(conf: SparkConf) extends 
SparkListener with Logging {
 (Some(e.toErrorString), None)
 }
 
+  if (!metrics.isEmpty) {
+val oldMetrics = 
stageData.taskData.get(info.taskId).flatMap(_.taskMetrics)
+updateAggregateMetrics(stageData, info.executorId, metrics.get, 
oldMetrics)
+  }
 
-  val taskRunTime = metrics.map(_.executorRunTime).getOrElse(0L)
-  stageData.executorRunTime += taskRunTime
-  val inputBytes = 
metrics.flatMap(_.inputMetrics).map(_.bytesRead).getOrElse(0L)
-  stageData.inputBytes += inputBytes
-
-  val shuffleRead = 
metrics.flatMap(_.shuffleReadMetrics).map(_.remoteBytesRead).getOrElse(0L)
-  stageData.shuffleReadBytes += shuffleRead
-
-  val shuffleWrite =
-
metrics.flatMap(_.shuffleWriteMetrics).map(_.shuffleBytesWritten).getOrElse(0L)
-  stageData.shuffleWriteBytes += shuffleWrite
-
-  val memoryBytesSpilled = 
metrics.map(_.memoryBytesSpilled).getOrElse(0L)
-  stageData.memoryBytesSpilled += memoryBytesSpilled
+  val taskData = stageData.taskData.getOrElseUpdate(info.taskId, new 
TaskUIData(info))
+  taskData.taskInfo = info
+  taskData.taskMetrics = metrics
+  taskData.errorMessage = errorMessage
+}
+  }
 
-  val diskBytesSpilled = metrics.map(_.diskBytesSpilled).getOrElse(0L)
-  stageData.diskBytesSpilled += diskBytesSpilled
+  def updateAggregateMetrics(
--- End diff --

Could you add a javadoc for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50849016
  
QA tests have started for PR 1679. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17644/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2711. Create a ShuffleMemoryManager to t...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1707#issuecomment-50849019
  
QA tests have started for PR 1707. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17645/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50849017
  
QA tests have started for PR 1056. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17646/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2532: Minimal shuffle consolidation fixe...

2014-07-31 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1678#discussion_r15682389
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/HashShuffleWriter.scala ---
@@ -120,8 +121,7 @@ private[spark] class HashShuffleWriter[K, V](
   private def revertWrites(): Unit = {
 if (shuffle != null && shuffle.writers != null) {
   for (writer <- shuffle.writers) {
-writer.revertPartialWrites()
-writer.close()
+writer.revertPartialWritesAndClose()
--- End diff --

revert can throw exception : which will cause other writers to not revert.
We need to wrap it in try/catch, log and continue


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15682370
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala ---
@@ -56,7 +56,7 @@ private[jobs] object UIData {
   }
 
   case class TaskUIData(
-  taskInfo: TaskInfo,
-  taskMetrics: Option[TaskMetrics] = None,
-  errorMessage: Option[String] = None)
+  var taskInfo: TaskInfo,
+  var taskMetrics: Option[TaskMetrics] = None,
+  var errorMessage: Option[String] = None)
--- End diff --

Actually scratch that (after a bit more thought). Let's keep it mutable but 
please put a comment explaining that the objects are re-used in order to avoid 
excessive allocation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15682364
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala ---
@@ -56,7 +56,7 @@ private[jobs] object UIData {
   }
 
   case class TaskUIData(
-  taskInfo: TaskInfo,
-  taskMetrics: Option[TaskMetrics] = None,
-  errorMessage: Option[String] = None)
+  var taskInfo: TaskInfo,
+  var taskMetrics: Option[TaskMetrics] = None,
+  var errorMessage: Option[String] = None)
--- End diff --

In general we strongly prefer immutable data structures because it's much 
easier to reason about their state. On this one I'd actually propose keeping it 
immutable and we can see if we need to adjust that in the future. I'd actually 
recommend decreasing the heartbeat interval to something like 10 seconds (or 
maybe less frequent) anyways. To avoid this and other problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2711. Create a ShuffleMemoryManager to t...

2014-07-31 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1707#issuecomment-50848891
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50848855
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-07-31 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50848836
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50848722
  
@bgreeven The filename 
`mllib/src/main/scala/org/apache/spark/mllib/ann/GeneralizedSteepestDescendAlgorithm`
 doesn't have `.scala` extension.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] A minor refactoring Java dat...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1712#issuecomment-50848654
  
QA tests have started for PR 1712. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17643/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] SPARK-2311: Added additional GLMs (Poi...

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1237#issuecomment-50848559
  
Sorry, I'm still working on it and will put the design doc to JIRA soon. 
But unfortunately, it may not be able to catch the v1.1 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1710#issuecomment-50848506
  
QA results for PR 1710:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17637/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] A minor refactoring Java dat...

2014-07-31 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/1712

[SPARK-2179][SQL] A minor refactoring Java data type APIs (2179 follow-up).

It is a follow-up PR of SPARK-2179 
(https://issues.apache.org/jira/browse/SPARK-2179). It makes package names of 
data type APIs more consistent across languages (Scala: `org.apache.spark.sql`, 
Java: `org.apache.spark.sql.api.java`, Python: `pyspark.sql`).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark javaDataType

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1712.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1712


commit add4bcbfd1748a06ec63852d1644dcc70d9c8636
Author: Yin Huai 
Date:   2014-08-01T04:32:20Z

Make the package names of data type classes consistent across languages by 
moving all Java data type classes to package sql.api.java.

commit 62eb705d0a7eb99476fc9b77a0c190cd22ebf10c
Author: Yin Huai 
Date:   2014-08-01T04:35:32Z

Move package-info.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] A minor refactoring Java dat...

2014-07-31 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1712#issuecomment-50848476
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50848434
  
QA results for PR 1056:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class SparkListenerExecutorMetricsUpdate(case class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterFor more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17641/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15682191
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -350,4 +353,47 @@ private[spark] class Executor(
   }
 }
   }
+
+  def stop() {
+isStopped = true
+threadPool.shutdown()
+  }
+
+  def startDriverHeartbeater() {
+val interval = conf.getInt("spark.executor.heartbeatInterval", 2000)
--- End diff --

should we turn this down to something like 10 seconds by default? Unlike in 
Hadoop, we don't rely on this in order for tasks to start or finish. It might 
be good to stay conservative here to make sure performance is not an issue. The 
main value as I see it for this feature in general is to deal with long running 
tasks anyways.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50848263
  
QA tests have started for PR 1635. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17642/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-31 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50848123
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15682056
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -320,6 +323,26 @@ private[spark] class TaskSchedulerImpl(
 }
   }
 
+  /**
+   * Update metrics for in-progress tasks and let the master know that the 
BlockManager is still
+   * alive. Return true if the driver knows about the given block manager. 
Otherwise, return false,
+   * indicating that the block manager should re-register.
+   */
+  override def executorHeartbeatReceived(
+  execId: String,
+  taskMetrics: Array[(Long, TaskMetrics)], // taskId -> TaskMetrics
+  blockManagerId: BlockManagerId): Boolean = {
+val metricsWithStageIds = taskMetrics.flatMap {
+  case (id, metrics) => {
+taskIdToTaskSetId.get(id)
--- End diff --

I think there is an unlikely race here where (a) a task heartbeat gets 
enqueued to be sent (b) the task actually finishes and that message is sent, 
then taskIdToTaskSetId is cleared (c) the heartbeat arrives. This is possible 
since the heartbeater and the task execution itself are in different threads. 
Then you'd get an NPE here. Though extremely unlikely, it might be good to just 
log a warning and pass if the task set is not found.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1710#issuecomment-50847713
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15681944
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -38,8 +37,10 @@ import org.apache.spark._
 import org.apache.spark.executor.TaskMetrics
 import org.apache.spark.partial.{ApproximateActionListener, 
ApproximateEvaluator, PartialResult}
 import org.apache.spark.rdd.RDD
+import org.apache.spark.storage._
 import org.apache.spark.storage.{BlockId, BlockManager, 
BlockManagerMaster, RDDBlockId}
--- End diff --

@sryza yeah this line should be removed if you are adding the catch-all 
import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50847534
  
QA results for PR 1369:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17640/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1710#issuecomment-50847511
  
QA results for PR 1710:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17635/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] mllib - upgrade to breeze 0.8.1

2014-07-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1703#issuecomment-50847433
  
This was discussed in https://github.com/apache/spark/pull/940 . Do you 
mind closing this PR? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15681898
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -991,6 +994,9 @@ class SparkContext(config: SparkConf) extends Logging {
 dagScheduler = null
 if (dagSchedulerCopy != null) {
   metadataCleaner.cancel()
+  if (heartbeatReceiver != null) {
--- End diff --

I don't see an execution code path where this could possibly be null. Is 
there one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50847382
  
QA results for PR 1679:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class StorageStatus(val blockManagerId: BlockManagerId, val 
maxMem: Long) {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17636/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50847358
  
QA tests have started for PR 1369. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17640/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-50847355
  
QA tests have started for PR 1056. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17641/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50847300
  
QA results for PR 1369:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17639/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-31 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r15681834
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -50,6 +50,7 @@ import org.apache.spark.scheduler.local.LocalBackend
 import org.apache.spark.storage.{BlockManagerSource, RDDInfo, 
StorageStatus, StorageUtils}
 import org.apache.spark.ui.SparkUI
 import org.apache.spark.util.{CallSite, ClosureCleaner, MetadataCleaner, 
MetadataCleanerType, TimeStampedWeakValueHashMap, Utils}
+import akka.actor.Props
--- End diff --

minor, but I think this belongs with the other non-spark imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1369#issuecomment-50847196
  
QA tests have started for PR 1369. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17639/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   5   6   >