[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/3633

[SPARK-4759] Avoid using empty string as default preferred location

See JIRA for reproduction.

Our use of empty string as default preferred location in 
`CoalescedRDDPartition` causes the `TaskSetManager` to schedule the 
corresponding task on host `` (empty string). The intended semantics here, 
however, is that the partition does not have a preferred location, and the TSM 
should schedule the corresponding task accordingly.

I tested this on master and 1.1.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark coalesce-preferred-loc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3633.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3633


commit 2f7dfb603c000a204831748f1fbaa53ef52531c8
Author: Andrew Or and...@databricks.com
Date:   2014-12-08T07:53:15Z

Avoid using empty string as default preferred location

This is causing the TaskSetManager to try to schedule certain
tasks on the host  (empty string). The intended semantics here,
however, is that the partition does not have preferred location,
and the TSM should schedule the corresponding task in accordance.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3633#issuecomment-66035505
  
  [Test build #24219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24219/consoleFull)
 for   PR 3633 at commit 
[`2f7dfb6`](https://github.com/apache/spark/commit/2f7dfb603c000a204831748f1fbaa53ef52531c8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/3634

[SPARK-3154][STREAMING] Replace ConcurrentHashMap with mutable.HashMap and 
remove @volatile from 'stopped'

Since `sequenceNumberToProcessor` and `stopped` are both protected by the 
lock `sequenceNumberToProcessor`, `ConcurrentHashMap` and `volatile` is 
unnecessary. So this PR updated them accordingly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-3154

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3634.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3634


commit 0d087ac6ae18ed7766d08dc630aeb12279dbb4e7
Author: zsxwing zsxw...@gmail.com
Date:   2014-12-08T08:02:14Z

Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 
'stopped'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3634#issuecomment-66036411
  
  [Test build #24220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24220/consoleFull)
 for   PR 3634 at commit 
[`0d087ac`](https://github.com/apache/spark/commit/0d087ac6ae18ed7766d08dc630aeb12279dbb4e7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3633#issuecomment-66040668
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24219/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3633#issuecomment-66040660
  
  [Test build #24219 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24219/consoleFull)
 for   PR 3633 at commit 
[`2f7dfb6`](https://github.com/apache/spark/commit/2f7dfb603c000a204831748f1fbaa53ef52531c8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add error message when making local dir unsucc...

2014-12-08 Thread XuTingjun
GitHub user XuTingjun opened a pull request:

https://github.com/apache/spark/pull/3635

Add error message when making local dir unsuccessfully



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/XuTingjun/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3635.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3635


commit 1c51a0c78c8477f4aae83ec18212c773aed57701
Author: meiyoula 1039320...@qq.com
Date:   2014-12-08T09:11:09Z

Update DiskBlockManager.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add error message when making local dir unsucc...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3635#issuecomment-66041481
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3640] [Streaming] [Kinesis] Allow users...

2014-12-08 Thread aniketbhatnagar
Github user aniketbhatnagar commented on the pull request:

https://github.com/apache/spark/pull/3092#issuecomment-66043882
  
@cfregly, unfortunately, I have been stuck with some other work and haven't 
been able to test this yet. I will find this week. Sorry for the delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3634#issuecomment-66043962
  
  [Test build #24220 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24220/consoleFull)
 for   PR 3634 at commit 
[`0d087ac`](https://github.com/apache/spark/commit/0d087ac6ae18ed7766d08dc630aeb12279dbb4e7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3634#issuecomment-66043975
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24220/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread akopich
Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66109601
  
(1) Users implementing their own regularizers

OK. I'd prefer to set all the methods private[mllib] for regularizers. 

(2) Regular and Robust in the same class

I understand what dynamic polymorphism is.  Unfortunately getNewTheta() 
methods have different parameters in robust and non-robust classes. 

What's more significant, user should know instance of which class is 
returned -- robust or non-robust. Without this knowledge one will have to cast 
returned parameter (e.g. of type `DocumentParameters` to type  
`RobustDocumentParametrs`  ) in order to access `noise` field.  That's why I 
see no way to provide a user with a single facade class. 

And thank you for mentioning visibility  -- my fault. 

(3) PLSA and RobustPLSA code duplication

Thank you very much for reading the code.

(4) Float vs. Double and linear algebra operations

OK. I'll use `Array[Array[Float]]` then. But you've mentioned, it'd be nice 
to extract all the linear algebra code to `mllib/linalg/`. Could you please 
point at my code implementing linear algebra operations that should be modved 
to `mllib/linalg/`. BTW I'm not sure if it's possible due to the fact that  
`mllib/linalg/` relies on `trait Matrix` and my code relies on 
`Array[Array[Float]]`. 

(5) You've also said, Enumerator should be private. I definitely can make 
it private and change a method `TopicModel.infer()` in the way for it to 
consume `RDD[Seq[String]]` instead of `RDD[Documents]` and call `Enumerator` in 
the method. 

But what if one wants consequently to train ten models (in order to choose 
the best parameters)?  Enumeration will be performed 10 times. Isn't it a waste?





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3634#issuecomment-66110750
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add error message when making local dir unsucc...

2014-12-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3635#discussion_r21449898
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -67,11 +67,14 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 if (subDir == null) {
   subDir = subDirs(dirId).synchronized {
 val old = subDirs(dirId)(subDirId)
-if (old != null) {
+if (old != null  old.exists()) {
   old
 } else {
   val newDir = new File(localDirs(dirId), %02x.format(subDirId))
-  newDir.mkdir()
+  val foundLocalDir = newDir.mkdir()
+   if (!foundLocalDir) {
--- End diff --

Indent has one too many spaces. The message should probably be a warning. 
It says ignoring this directory but it doesn't seem to be ignored? You 
changed the semantics of the condition too, to replace a value that was a 
non-existent dir. That seems reasonable, but this can replace it with a 
directory that can't be created for some reason. Is this not an exception 
condition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66111941
  
  [Test build #24221 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24221/consoleFull)
 for   PR 1269 at commit 
[`24b11a5`](https://github.com/apache/spark/commit/24b11a57bdd18bdeb0409000cb836235227e6d25).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66112025
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24221/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66112020
  
  [Test build #24221 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24221/consoleFull)
 for   PR 1269 at commit 
[`24b11a5`](https://github.com/apache/spark/commit/24b11a57bdd18bdeb0409000cb836235227e6d25).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66112914
  
  [Test build #24222 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24222/consoleFull)
 for   PR 1269 at commit 
[`4a4a4f8`](https://github.com/apache/spark/commit/4a4a4f84da1954f585f2474ab3ee06c5b998c990).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66113672
  
  [Test build #24222 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24222/consoleFull)
 for   PR 1269 at commit 
[`4a4a4f8`](https://github.com/apache/spark/commit/4a4a4f84da1954f585f2474ab3ee06c5b998c990).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66113681
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24222/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66114412
  
QA tests have started for PR 1269. This patch DID NOT merge cleanly! 
brView progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24223/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread Lewuathe
GitHub user Lewuathe opened a pull request:

https://github.com/apache/spark/pull/3636

[SPARK-3382] GradientDescent convergence tolerance

GrandientDescent can receive convergence tolerance value. Default value is 
0.0.
When loss value becomes less than the tolerance which is set by user, 
iteration is terminated.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Lewuathe/spark gd-convergence-tolerance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3636.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3636


commit 5433f71a3822b0fb16b910f64dc53ede8d539ebe
Author: lewuathe lewua...@me.com
Date:   2014-12-08T13:19:21Z

[SPARK-3382] GradientDescent convergence tolerance




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3636#issuecomment-66115272
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4001][MLlib] adding apriori and fp-grow...

2014-12-08 Thread denmoroz
Github user denmoroz commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-66125004
  
Maybe it is better to use RDD[BitSet] as transactions RDD? Then you can add 
a preprocessor trait and make any transformations for source RDD to RDD of 
BitSets. For example, transformation of RDD[Array[String]] to RDD[BitSet].
It seems to me, that BitSet is the much better idea of transactions 
representation then Array[String] or Array[Int] or anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66126236
  
QA results for PR 1269:br- This patch FAILED unit tests.brbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24223/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66126247
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24223/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread akopich
Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66127106
  
@jkbradley, could you please have a look at logs -- a have no idea why 
PySpark tests failed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4001][MLlib] adding apriori and fp-grow...

2014-12-08 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-66129552
  
As long as itemset mining is under consideration, has anybody tried a Spark 
implementation of Logical Itemset Mining:
http://cvit.iiit.ac.in/papers/Chandrashekar2012Logical.pdf



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4001][MLlib] adding apriori and fp-grow...

2014-12-08 Thread denmoroz
Github user denmoroz commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-66130537
  
Dou you use SON algorithm for Apriori parallel implementation?
(http://importantfish.com/limited-pass-algorithms/)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4764] Ensure that files are fetched ato...

2014-12-08 Thread preaudc
Github user preaudc commented on the pull request:

https://github.com/apache/spark/pull/2855#issuecomment-66131633
  
Thanks for the review, @JoshRosen, I've created a new JIRA as requested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-08 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3409#issuecomment-66134213
  
Its a matter of whats more obvious to the user who doesn't necessarily read 
the documentation.  Adding in clientmode hopefully helps the user realize this 
config only does something in yarn-client mode. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.

2014-12-08 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3215#issuecomment-66135534
  
seems like we are pretty close on the rc.  I'm good with merging this. 
@andrewor14  any objections at this point?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4770. [DOC] [YARN] spark.scheduler.minRe...

2014-12-08 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3624#issuecomment-66139557
  
+1. Thanks Sandy!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4770. [DOC] [YARN] spark.scheduler.minRe...

2014-12-08 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3624#issuecomment-66140307
  
@pwendell  is it ok to pull this doc change into 1.2?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-08 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-66144913
  
Hi @JoshRosen - can I please get this run through Jenkins? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-66145968
  
  [Test build #24224 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24224/consoleFull)
 for   PR 3518 at commit 
[`ef3dd39`](https://github.com/apache/spark/commit/ef3dd39109aca93e899affef8716655aa7669ce0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-66145560
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-66158875
  
  [Test build #24224 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24224/consoleFull)
 for   PR 3518 at commit 
[`ef3dd39`](https://github.com/apache/spark/commit/ef3dd39109aca93e899affef8716655aa7669ce0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-66158890
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24224/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3574#discussion_r21473307
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1010,7 +1010,10 @@ private[spark] class BlockManager(
   info.synchronized {
 // required ? As of now, this will be invoked only for blocks 
which are ready
 // But in case this changes in future, adding for consistency sake.
-if (!info.waitForReady()) {
+if (blockInfo.get(blockId).isEmpty) {
+  logWarning(sBlock $blockId was already dropped.)
+  return None
+} else if(!info.waitForReady()) {
--- End diff --

Minor style nit: this needs a space after the `if` and before the open 
paren: `if (!info...`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3574#issuecomment-66162338
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3574#issuecomment-66163112
  
  [Test build #24225 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24225/consoleFull)
 for   PR 3574 at commit 
[`55fa4ba`](https://github.com/apache/spark/commit/55fa4ba1e41eb36b1c4f867efbdd35c9b8a4f131).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4298][Core] - The spark-submit cannot r...

2014-12-08 Thread brennonyork
Github user brennonyork commented on the pull request:

https://github.com/apache/spark/pull/3561#issuecomment-66163505
  
@JoshRosen I'm pretty sure we can definitely support the `hdfs://` URI 
model. I'll look and see if, given an `hdfs://` URI, Spark would already have 
some sort of Hadoop `Configuration` object representing the connection made, 
but, if not, can always make one.

Also, can you help me understand why the tests failed? I'm seeing:

`[error] (streaming/test:test) sbt.TestsFailedException: Tests unsuccessful`

But that isn't really that helpful and, as with all the talk on the dev 
distro, I'm just wondering if its the patch that fails or if its a timing / 
sync issue (`./dev/run-tests` finishes without fail on my OSX machine).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4616][Core] - SPARK_CONF_DIR is not eff...

2014-12-08 Thread brennonyork
Github user brennonyork commented on the pull request:

https://github.com/apache/spark/pull/3559#issuecomment-66163851
  
@JoshRosen Is there anything else needed for this patch to be pushed in? 
Any feedback / review would be great as well!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add example that reads a local file, writes to...

2014-12-08 Thread rnowling
Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/3347#issuecomment-66167709
  
@andrewor14 Could you take a second look when you get a chance?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4298][Core] - The spark-submit cannot r...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3561#issuecomment-66168555
  
Hmm, it looks like there's already a JIRA for that particular test's 
flakiness: [SPARK-1600](https://issues.apache.org/jira/browse/SPARK-1600).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...

2014-12-08 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3375#discussion_r21477718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala
 ---
@@ -68,62 +68,59 @@ case class HashOuterJoin(
   @transient private[this] lazy val DUMMY_LIST = Seq[Row](null)
   @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row]
 
+  @transient private[this] lazy val joinedRow = new JoinedRow()
--- End diff --

I believe that it is working now, but my objection is primarily to having 
mutable state stored inside of the task instead of local to a single execution. 
 If we decide to be more clever about sharing task metadata in the future this 
could break in a very subtle ways.  Also, the cost of accessing a lazy val is 
almost certainly higher than accessing a local stack variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3409#discussion_r21478911
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -358,6 +358,21 @@ private[spark] trait ClientBase extends Logging {
   if (libraryPaths.nonEmpty) {
 prefixEnv = Some(Utils.libraryPathEnvPrefix(libraryPaths))
   }
+} else {
+  // Validate and include yarn am specific java options in yarn-client 
mode.
+  val amOptsKey = spark.yarn.clientmode.am.extraJavaOptions
+  val amOpts = sparkConf.getOption(amOptsKey)
+  amOpts.map { javaOpts =
--- End diff --

I'd just simplify this as:

sparkConf.getOption(amOptsKey).foreach { opts =
  // validate
  // javaOpts += opts
}

Hint: `map()` is more expensive than `foreach()` in general (because it 
returns something, unlike foreach).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4764] Ensure that files are fetched ato...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2855#issuecomment-66173949
  
Thanks for creating the new JIRA.  This looks good to me, so I'm going to 
merge it into `master` and `branch-1.1` for now (I've added a `backport-needed` 
label to the JIRA so that we remember to merge this into `branch-1.2` after the 
1.2.0 vote ends).  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-08 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/3409#issuecomment-66174011
  
LGTM aside from minor style issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4764] Ensure that files are fetched ato...

2014-12-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2855


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/3634#issuecomment-66175119
  
+1. Looks good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/3634#discussion_r21479585
  
--- Diff: 
external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
 ---
@@ -47,8 +47,8 @@ private[flume] class SparkAvroCallbackHandler(val 
threads: Int, val channel: Cha
   val transactionExecutorOpt = Option(Executors.newFixedThreadPool(threads,
 new ThreadFactoryBuilder().setDaemon(true)
   .setNameFormat(Spark Sink Processor Thread - %d).build()))
-  private val sequenceNumberToProcessor =
-new ConcurrentHashMap[CharSequence, TransactionProcessor]()
+  // Protected by `sequenceNumberToProcessor`
--- End diff --

Could use the `@GuardedBy(sequenceNumberToProcessor)` javax annotation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3154][STREAMING] Replace ConcurrentHash...

2014-12-08 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/3634#issuecomment-66175687
  
LGTM too, at your discretion you could replace the comment with the 
annotation or not. Will merge when addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/3637

[SPARK-4789] [mllib] Standardize ML Prediction APIs

This is part (1) of the updates from the WIP PR in 
[https://github.com/apache/spark/pull/3427]

Abstract classes for learning algorithms:
* Classifier
* Regressor
* Predictor

Traits for learning algorithms
* ProbabilisticClassificationModel

Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)

Concrete classes: other
* LabeledPoint (adding weight to the old LabeledPoint)

Other updates:
* Modified ParamMap to sort parameters in toString

Test Suites:
* LabeledPointSuite
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites

CC: @mengxr  @etrain  @shivaram 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark ml-api-part1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3637.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3637


commit de1e3b4c39b42757e56345a6bab2bdeefaa3ca25
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-11-24T07:18:52Z

Added lots of classes for new ML API:

Abstract classes for learning algorithms:
* Classifier
* Regressor
* Predictor

Traits for learning algorithms
* HasDefaultEstimator
* IterativeEstimator
* IterativeSolver
* ProbabilisticClassificationModel
* WeakLearner

Concrete classes: learning algorithms
* AdaBoost (partly implemented)
* NaiveBayes (rough implementation)
* LinearRegression
* LogisticRegression (updated to use new abstract classes)

Concrete classes: evaluation
* ClassificationEvaluator
* RegressionEvaluator
* PredictionEvaluator

Concrete classes: other
* LabeledPoint (adding weight to the old LabeledPoint)

commit 6551244b96d8f70f1daacd0415318cf81fd5111a
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-11-24T07:30:31Z

fixed compilation issues, but have not added tests yet

commit 25b643d4b367fea5a3ba1b91564374c2b1b7a0f1
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-01T18:31:41Z

removing everything except for simple class hierarchy for classification

commit e61e2738dcb2494be25cec2bd798c3e6e5156b73
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-04T21:37:29Z

Added LinearRegression and Regressor back from ml-api branch

commit 272e62fb41fc8778f3a13f812d4262d9558a772b
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-05T00:11:02Z

Modified ParamMap to sort parameters in toString.  Cleaned up classes in 
class hierarchy, before implementing tests and examples.

commit cc13d61f2a277b101f7422af240afa64dfb10236
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-05T01:11:22Z

Fixed bug from last commit (sorting paramMap by parameter names in 
toString).  Fixed bug in persisting logreg data.  Added threshold_internal to 
logreg for faster test-time prediction (avoiding map lookup).

commit 09fb85fb7502a64a661c5f8ae4c941971ff861c8
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-05T18:22:10Z

Fixed issue with logreg threshold being set correctly

commit a0faf022792524c5a33a20d7cb591a91a7ac160b
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-05T18:43:14Z

Updated docs.  Added LabeledPointSuite to spark.ml

commit 3e961cb6616906940fd646639f818c58d29c04f6
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-05T23:15:48Z

* Changed semantics of Predictor.train() to merge the given paramMap with 
the embedded paramMap.
* remove threshold_internal from logreg
* Added Predictor.copy()
* Extended LogisticRegressionSuite

commit 8922966757e7b5d7588613f5dfc11cee267de1b4
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-06T01:32:14Z

added train() to Predictor subclasses which does not take a ParamMap.

commit 0c45756e3614c027d662d70dfa11d736690dc837
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-06T03:57:12Z

* fixed LinearRegression train() to use embedded paramMap
* added Predictor.predict(RDD[Vector]) method
* updated Linear/LogisticRegressionSuites

commit 6be36c16484478bdb9d847fd343d6b7319759b21
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-06T06:18:30Z

Added JavaLabeledPointSuite.java for spark.ml, and added constructor to 
LabeledPoint which defaults weight to 1.0

commit d8eaf7099a9be6157f90b11f82917ca5b604e1bd
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-08T19:09:03Z

Added methods:
* Classifier: batch predictRaw()
   

[GitHub] spark pull request: [MLLIB] [WIP] [SPARK-3702] Standardizing abstr...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3427#issuecomment-66177125
  
I just submitted the first part of this PR: 
[https://github.com/apache/spark/pull/3637/files]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66177658
  
  [Test build #24226 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24226/consoleFull)
 for   PR 3637 at commit 
[`1e46094`](https://github.com/apache/spark/commit/1e46094fbf2534ff022cb843a811b3fbd7fb9d64).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66177711
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24226/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3574#issuecomment-66177706
  
  [Test build #24225 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24225/consoleFull)
 for   PR 3574 at commit 
[`55fa4ba`](https://github.com/apache/spark/commit/55fa4ba1e41eb36b1c4f867efbdd35c9b8a4f131).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66177709
  
  [Test build #24226 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24226/consoleFull)
 for   PR 3637 at commit 
[`1e46094`](https://github.com/apache/spark/commit/1e46094fbf2534ff022cb843a811b3fbd7fb9d64).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LabeledPoint(label: Double, features: Vector, weight: 
Double) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3574#issuecomment-66177717
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24225/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3636#discussion_r21480864
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -27,6 +27,8 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.mllib.linalg.{Vectors, Vector}
 import org.apache.spark.mllib.rdd.RDDFunctions._
 
+import scala.util.control.Breaks
--- End diff --

Please organize imports (Scala/Java, then non-Spark imports, then Spark)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3636#discussion_r21480867
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -39,6 +41,7 @@ class GradientDescent private[mllib] (private var 
gradient: Gradient, private va
   private var numIterations: Int = 100
   private var regParam: Double = 0.0
   private var miniBatchFraction: Double = 1.0
+  private var convergenceTolerance: Double = 0.0
--- End diff --

I feel like the default should be  0.0.  Something small like 0.001 (a 
value pulled from libsvm 
[https://github.com/cjlin1/libsvm/blob/master/python/svm.py]) might be 
reasonable.  Basically, I think that convergence tolerance is generally a 
better stopping criterion than numIterations, and having it  0.0 will give it 
a chance of taking effect before numIterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3636#discussion_r21480907
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -182,34 +195,38 @@ object GradientDescent extends Logging {
 var regVal = updater.compute(
   weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, 
regParam)._2
 
-for (i - 1 to numIterations) {
-  val bcWeights = data.context.broadcast(weights)
-  // Sample a subset (fraction miniBatchFraction) of the total data
-  // compute and sum up the subgradients on this subset (this is one 
map-reduce)
-  val (gradientSum, lossSum, miniBatchSize) = data.sample(false, 
miniBatchFraction, 42 + i)
-.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
-  seqOp = (c, v) = {
-// c: (grad, loss, count), v: (label, features)
-val l = gradient.compute(v._2, v._1, bcWeights.value, 
Vectors.fromBreeze(c._1))
-(c._1, c._2 + l, c._3 + 1)
-  },
-  combOp = (c1, c2) = {
-// c: (grad, loss, count)
-(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
-  })
-
-  if (miniBatchSize  0) {
-/**
- * NOTE(Xinghao): lossSum is computed using the weights from the 
previous iteration
- * and regVal is the regularization value computed in the previous 
iteration as well.
- */
-stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
-val update = updater.compute(
-  weights, Vectors.fromBreeze(gradientSum / 
miniBatchSize.toDouble), stepSize, i, regParam)
-weights = update._1
-regVal = update._2
-  } else {
-logWarning(sIteration ($i/$numIterations). The size of sampled 
batch is zero)
+val b = new Breaks
+b.breakable {
+  for (i - 1 to numIterations) {
+val bcWeights = data.context.broadcast(weights)
+// Sample a subset (fraction miniBatchFraction) of the total data
+// compute and sum up the subgradients on this subset (this is one 
map-reduce)
+val (gradientSum, lossSum, miniBatchSize) = data.sample(false, 
miniBatchFraction, 42 + i)
+  .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
+seqOp = (c, v) = {
+  // c: (grad, loss, count), v: (label, features)
+  val l = gradient.compute(v._2, v._1, bcWeights.value, 
Vectors.fromBreeze(c._1))
+  (c._1, c._2 + l, c._3 + 1)
+},
+combOp = (c1, c2) = {
+  // c: (grad, loss, count)
+  (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
+})
+
+if (miniBatchSize  0) {
+  /**
+   * NOTE(Xinghao): lossSum is computed using the weights from the 
previous iteration
+   * and regVal is the regularization value computed in the 
previous iteration as well.
+   */
+  stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
+  val update = updater.compute(
+weights, Vectors.fromBreeze(gradientSum / 
miniBatchSize.toDouble), stepSize, i, regParam)
+  weights = update._1
+  regVal = update._2
+  if (stochasticLossHistory.last  convergenceTolerance) b.break
--- End diff --

This is comparing convergenceTolerance with the objective from the last 
iteration.  It should compare with the absolute value of the difference between 
the objective from the last iteration and the objective from the iteration 
before that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3636#discussion_r21480898
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -77,6 +80,14 @@ class GradientDescent private[mllib] (private var 
gradient: Gradient, private va
   }
 
   /**
+   * Set the convergence tolerance. Default 0.0
--- End diff --

It would be good to note what convergence tolerance is.  In particular, can 
you please note that it is compared with the change in the objective between 
consecutive iterations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3636#discussion_r21480909
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -219,4 +236,17 @@ object GradientDescent extends Logging {
 (weights, stochasticLossHistory.toArray)
 
   }
+
+  def runMiniBatchSGD(
--- End diff --

It is odd to have an API with 2 different argument orders.  Can this please 
be fixed in 1 of these 2 ways:
(1) Keep the old argument order, and have convergenceTolerance come after 
initialWeights.
(2) Remove this old method call completely, and update the code base where 
relevant.
I vote for (1) for consistency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3636#issuecomment-66178359
  
@Lewuathe Thanks for the PR!  I added some inline comments.  One more 
general comment: When using subsampling (miniBatchFraction  1.0), testing 
against a convergenceTolerance can be dangerous because of the stochasticity.  
It can be would be good to add a check at the beginning of optimization to see 
if miniBatchFraction  1.0  convergenceTolerance  0.0.  If that is the case, 
then we should print a warning.

Let me know when I should make another pass over the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66179960
  
  [Test build #24227 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24227/consoleFull)
 for   PR 3637 at commit 
[`83109eb`](https://github.com/apache/spark/commit/83109ebef2fca4b6d28a83bf405c2edf1e5075db).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread mccheah
GitHub user mccheah opened a pull request:

https://github.com/apache/spark/pull/3638

[SPARK-4737] Task set manager properly handles serialization errors

Dealing with [SPARK-4737], the handling of serialization errors should not 
be the DAGScheduler's responsibility. The task set manager now catches the 
error and aborts the stage.

If the TaskSetManager throws a TaskNotSerializableException, the 
TaskSchedulerImpl will return an empty list of task descriptions, because no 
tasks were started. The scheduler should abort the stage gracefully.

Note that I'm not too familiar with this part of the codebase and its place 
in the overall architecture of the Spark stack. If implementing it this way 
will have any averse side effects please voice that loudly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mccheah/spark 
task-set-manager-properly-handle-ser-err

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3638


commit 097e7a21e15d3adf45687bd58ff095088f0282f7
Author: mcheah mch...@palantir.com
Date:   2014-12-06T01:45:41Z

[SPARK-4737] Catching task serialization exception in TaskSetManager

Our previous attempt at handling un-serializable tasks involved
selectively sampling a task from a task set, and attempting to serialize
it. If the serialization was successful, we assumed that all tasks in
the task set would also be serializable.

Unfortunately, this is not always the case. For example,
ParallelCollectionRDD may have both empty and non-empty partitions, and
the empty partitions would be serializable while the non-empty
partitions actually contain non-serializable objects. This is one of
many examples where sampling task serialization breaks.

When task serialization exceptions occurred in the TaskSchedulerImpl and
TaskSetManager, the result was that the exception was not caught and the
entire scheduler would crash. It would restart, but in a bad state.

There's no reason why the stage should not be aborted if any
serialization error occurs when submitting a task set. If any task in a
task set throws an exception upon serialization, the task set manager
informs the DAGScheduler that the stage failed, aborts the stage. The
TaskSchedulerImpl needs to return a set of task descriptions that were
successfully submitted, but the set will be empty in the case of a
serialization error.

commit bf5e706918d92c761fa537a88bc15ec2c4cc7838
Author: mcheah mch...@palantir.com
Date:   2014-12-08T20:39:45Z

Fixing indentation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66184682
  
  [Test build #24228 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24228/consoleFull)
 for   PR 3638 at commit 
[`bf5e706`](https://github.com/apache/spark/commit/bf5e706918d92c761fa537a88bc15ec2c4cc7838).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66184722
  
  [Test build #24228 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24228/consoleFull)
 for   PR 3638 at commit 
[`bf5e706`](https://github.com/apache/spark/commit/bf5e706918d92c761fa537a88bc15ec2c4cc7838).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TaskNotSerializableException(error: Throwable) extends 
Exception(error)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66184723
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24228/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66186961
  
  [Test build #24229 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24229/consoleFull)
 for   PR 3638 at commit 
[`5f486f4`](https://github.com/apache/spark/commit/5f486f462233ae63987aa483e6d6eab342feef96).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66187144
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24229/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66187140
  
  [Test build #24229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24229/consoleFull)
 for   PR 3638 at commit 
[`5f486f4`](https://github.com/apache/spark/commit/5f486f462233ae63987aa483e6d6eab342feef96).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TaskNotSerializableException(error: Throwable) extends 
Exception(error)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66188975
  
  [Test build #24230 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24230/consoleFull)
 for   PR 3638 at commit 
[`94844d7`](https://github.com/apache/spark/commit/94844d736ed0d8322e2e0dda762961a9170d6a1d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-6615
  
  [Test build #24227 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24227/consoleFull)
 for   PR 3637 at commit 
[`83109eb`](https://github.com/apache/spark/commit/83109ebef2fca4b6d28a83bf405c2edf1e5075db).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LabeledPoint(label: Double, features: Vector, weight: 
Double) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66188897
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24227/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3996: Shade Jetty in Spark deliverables.

2014-12-08 Thread mccheah
Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/3130#issuecomment-66189849
  
Wanted to follow up on this - the priority of getting this done was just 
increased for us.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3633#issuecomment-66192578
  
  [Test build #24231 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24231/consoleFull)
 for   PR 3633 at commit 
[`f370a4e`](https://github.com/apache/spark/commit/f370a4e710b1ff29a5749944a1557de233223dc6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-08 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66192930
  
@avulanov I did couple performance turning in the MLOR gradient calculation 
in my company's proprietary implementation which results 4x faster than the 
open source one in github you tested. I'm trying to make it open source and 
merge into spark soon. (ps, simple polynomial expansion with MLOR can increase 
the mnist8m accuracy from 86% to 94% in my experiment. See Prof. CJ Lin's talk 
- https://www.youtube.com/watch?v=GCIJP0cLSmU ) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] SPARK-2450 Adds exeuctor log links to We...

2014-12-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3486#discussion_r21489595
  
--- Diff: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ---
@@ -50,10 +50,16 @@ private[spark] class CoarseGrainedExecutorBackend(
   override def preStart() {
 logInfo(Connecting to driver:  + driverUrl)
 driver = context.actorSelection(driverUrl)
-driver ! RegisterExecutor(executorId, hostPort, cores)
+driver ! RegisterExecutor(executorId, hostPort, cores, extractLogUrls)
 context.system.eventStream.subscribe(self, 
classOf[RemotingLifecycleEvent])
   }
 
+  def extractLogUrls : Map[String, String] = {
+val prefix = SPARK_LOG_URL_
--- End diff --

On a related note, I added proper command line parsing to 
CoarseGrainedExecutorBackend over in #3233, which could be a nicer alternative 
to env variables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] SPARK-2450 Adds exeuctor log links to We...

2014-12-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3486#discussion_r21490039
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -183,6 +193,16 @@ trait SparkListener {
* Called when the driver receives task metrics from an executor in a 
heartbeat.
*/
   def onExecutorMetricsUpdate(executorMetricsUpdate: 
SparkListenerExecutorMetricsUpdate) { }
+
+  /**
+   * Called when the driver registers a new executor.
+   */
+  def onExecutorAdded(executorAdded: SparkListenerExecutorAdded) { }
--- End diff --

Hmmm. This is going to be one of those cases where it breaks existing code 
that extends this class. Not sure if there's a good workaround (even though it 
is marked as `@DeveloperApi`). :-/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] SPARK-2450 Adds exeuctor log links to We...

2014-12-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3486#discussion_r21490459
  
--- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala 
---
@@ -79,6 +80,7 @@ private[ui] class ExecutorsPage(
   Shuffle Write
 /span
   /th
+  th class=sorttable_nosortLogs/th
--- End diff --

Should this be conditioned on whether logs actually exist?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Cdh5

2014-12-08 Thread orenmazor
GitHub user orenmazor opened a pull request:

https://github.com/apache/spark/pull/3639

Cdh5

https://github.com/Shopify/dataops/issues/2

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Shopify/spark cdh5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3639


commit 422de4cc2a823e16b86fd22095e35d1ebe842a12
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-15T01:29:43Z

Add compile script for packserv

commit 4ffa04cc6cc7bb8086a422a94d4f2e4105a69786
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-15T02:18:29Z

Don't compile streaming when assembling cause it doesn't build against 
CDH4.4.0

commit b7bf08171e8eb796d86408ce5712175d781e0f8d
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-15T02:22:10Z

Make script compile executable

commit 65033e665c75f4e82b56c8113c99308f8b419704
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-15T02:22:57Z

Make script compile bash

commit 9e6fc96f461864f4ffdd6c8aefaa53b6fd8c4ae0
Author: Mark Cooper mcoo...@quantcast.com
Date:   2013-11-20T22:26:42Z

Add a environment variable that allows for configuring a different path to 
Spark binaries when running Spark from a different location locally

commit fdb0ce298048832f75b24b464fdf59fb791f869f
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-15T19:24:05Z

Add fixed conf file with proper master and remote spark home

commit a837356d7d84641ab504522e74cedc4b5d865aa3
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T01:50:22Z

Copy in hadoop core-site.xml so local clones know where to find hdfs.

commit 4d0f3682e0931c21ba6e5b01fc42ee33a44453e1
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T02:12:36Z

Update the given spark env to actually work, and only if a custom master 
isn't provided.

commit 01cf4c51f2c3c3089ee91dd64d6cab32dd17aa70
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T04:22:11Z

Allow controlling the number of cores pyspark uses using the `-c` option, 
like spark-shell.
 
 - Turns out there isn't actually a way right now to control the number of 
cores an interactive pyspark session uses, which is annoying if more than one 
person is trying to work on a cluster interactively at once. 
 - Use the python 2.7 stdlib argparse library to pull out the -c option
 - This requires changing the bin/pyspark shell script to pass all 
arguments to the python script instead of allowing the python interpreter 
program to parse any of them.

commit 91ddfb4c43a88a4cf0082e445e2e82bcde069969
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T04:22:34Z

Merge branch 'pyspark_cores'

commit 0b44511492131b60f744527eee467fd147e4f4c0
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T14:51:05Z

Revert Merge branch 'pyspark_cores'

This reverts commit 91ddfb4c43a88a4cf0082e445e2e82bcde069969, reversing
changes made to 4d0f3682e0931c21ba6e5b01fc42ee33a44453e1.

commit b4c5ff7e7d6d550743e3aa97710fa514744b0c6e
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T15:22:09Z

Auto setup python and warn if the vpn isn't connected

commit 4c2c45eaf14197b79cf5949bb370a74c52a38ff0
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T15:34:05Z

Add an applescript to the spark conf file that autoconnects the VPN if it 
can't find the interface the VPN should create

commit 712b8856e4b14f88d34da569505c59884d8e8155
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T19:24:21Z

Check to see if Viscosity is a thing before trying to tell it to connect in 
spark env setup

commit 986a60c0b9a880ee0fd7242e53458efbebfab73e
Author: Harry Brundage harry.brund...@gmail.com
Date:   2014-01-21T21:21:18Z

Merge pull request #1 from Shopify/autoconnect_vpn

Autoconnect VPN

commit b944fc6ff5f5866b93e937f7d7629370c24944f0
Author: Dana Klassen klassen.d...@gmail.com
Date:   2014-01-23T01:36:53Z

change configuration to be set through environment variable

commit 5eace91604360da5b446a96582f141b09ab109c1
Author: Erik Selin erik.se...@jadedpixel.com
Date:   2014-01-23T04:14:17Z

apply pr 494 and 496

commit 42dc1708daec21a3ba302f61f473afa57fb5c12c
Author: Dana Klassen klassen.d...@gmail.com
Date:   2014-01-23T12:15:27Z

Merge pull request #2 from Shopify/config_hdfs

Config hdfs

commit 2b6c170b50f58ccdbe1e2faaf4ff3439bdf9e01e
Author: Erik Selin tyr...@gmail.com
Date:   2014-01-23T15:28:32Z

Merge pull request #3 from Shopify/apply_494_and_496

apply pr 494 and 496

commit 25c5a0d90c5926133b32e43f5e6a8d1a58c0685c
Author: Patrick Wendell pwend...@gmail.com

[GitHub] spark pull request: Cdh5

2014-12-08 Thread orenmazor
Github user orenmazor closed the pull request at:

https://github.com/apache/spark/pull/3639


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] SPARK-2450 Adds exeuctor log links to We...

2014-12-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3486#discussion_r21490919
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -183,6 +193,16 @@ trait SparkListener {
* Called when the driver receives task metrics from an executor in a 
heartbeat.
*/
   def onExecutorMetricsUpdate(executorMetricsUpdate: 
SparkListenerExecutorMetricsUpdate) { }
+
+  /**
+   * Called when the driver registers a new executor.
+   */
+  def onExecutorAdded(executorAdded: SparkListenerExecutorAdded) { }
--- End diff --

BTW doesn't this break the build? There are a few listeners in Spark code 
itself (e.g. `EventLoggingListener`) which should have broken because of this.

(BTW fixing that listener means you'll probably need to touch 
`JsonProtocol` to serialize these new events to the event log... and you'll 
need to be careful not to keep the log URLs in the replayed UIs since they'll 
most probably be broken links at that point. Meaning that probably the UI 
listener should nuke the log URLs when the executor removed message is 
handled.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] SPARK-2450 Adds exeuctor log links to We...

2014-12-08 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3486#discussion_r21491005
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -183,6 +193,16 @@ trait SparkListener {
* Called when the driver receives task metrics from an executor in a 
heartbeat.
*/
   def onExecutorMetricsUpdate(executorMetricsUpdate: 
SparkListenerExecutorMetricsUpdate) { }
+
+  /**
+   * Called when the driver registers a new executor.
+   */
+  def onExecutorAdded(executorAdded: SparkListenerExecutorAdded) { }
--- End diff --

Ah wait. I see. These methods have default implementations, so they'll only 
affect people extending `SparkListener` from Java. Still, we should probably 
save these events to the log for replay later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3574#discussion_r21491375
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1010,7 +1010,10 @@ private[spark] class BlockManager(
   info.synchronized {
 // required ? As of now, this will be invoked only for blocks 
which are ready
--- End diff --

This comment actually refers to the `!info.waitForReady()` case, so I'd 
like to either move the comment or swap the order of these checks so that we 
check for `blockInfo.get(blockId).isEmpty` in the `else if` clause instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4714][CORE]: Add checking info is null ...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3574#issuecomment-66199309
  
Left one minor code organization comment; aside from that, this looks good 
to me and should be ready to merge after you fix that up (I can do it if you 
don't have time, though; just let me know).

There are a couple of edits that I'd like to make to the commit title / 
description before merging this, but I can do it myself on merge.

Thanks for the careful analysis and for catching this issue!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66201371
  
  [Test build #24230 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24230/consoleFull)
 for   PR 3638 at commit 
[`94844d7`](https://github.com/apache/spark/commit/94844d736ed0d8322e2e0dda762961a9170d6a1d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TaskNotSerializableException(error: Throwable) extends 
Exception(error)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3638#issuecomment-66201380
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24230/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3409#issuecomment-66202544
  
  [Test build #24232 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24232/consoleFull)
 for   PR 3409 at commit 
[`e3f9abe`](https://github.com/apache/spark/commit/e3f9abeaa82018835cd9a7055adba0dabc451a24).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66203211
  
The test failure reveals an issue in Spark SQL (ScalaReflection.scala:121 
in schemaFor) where it gets confused if the case class includes multiple 
constructors.  The default behavior should probably be to take the constructor 
with the most arguments, but I'll consult others about this.  This PR may be on 
temporary hold...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3382] GradientDescent convergence toler...

2014-12-08 Thread Lewuathe
Github user Lewuathe commented on the pull request:

https://github.com/apache/spark/pull/3636#issuecomment-66203442
  
@jkbradley Thank you for reviewing. I'll update these points soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-08 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r21493950
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala 
---
@@ -54,8 +46,25 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   loadEnvironmentArgs()
   validateArgs()
 
+  // Additional memory to allocate to containers
+  // For now, use driver's memory overhead as our AM container's memory 
overhead
--- End diff --

This comment is no longer true


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4770. [DOC] [YARN] spark.scheduler.minRe...

2014-12-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3624#issuecomment-66204351
  
@tgravescs It should be fine to pull docs-only changes into `branch-1.2`.  
We're trying to hold off on merging code changes that aren't addressing 1.2.0 
release blockers because we don't want to risk introducing new regressions and 
having to call new votes.  If you do want to merge a code change that should 
eventually be backported into `branch-1.2`, just merge it into the other 
branches, leave its JIRA open with 1.2.1 listed in Target Version/s and not Fix 
Version/s, then add the `backport-needed` label to the issue so that we 
remember to come back to it after 1.2.0 is released.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3633#issuecomment-66204818
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24231/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4759] Avoid using empty string as defau...

2014-12-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3633#issuecomment-66204811
  
  [Test build #24231 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24231/consoleFull)
 for   PR 3633 at commit 
[`f370a4e`](https://github.com/apache/spark/commit/f370a4e710b1ff29a5749944a1557de233223dc6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-08 Thread Lewuathe
Github user Lewuathe commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r21494740
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.beans.BeanInfo
+
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.mllib.linalg.Vector
+
+/**
+ * :: AlphaComponent ::
+ * Class that represents an instance (data point) for prediction tasks.
+ *
+ * @param label Label to predict
+ * @param features List of features describing this instance
+ * @param weight Instance weight
+ */
+@AlphaComponent
+@BeanInfo
+case class LabeledPoint(label: Double, features: Vector, weight: Double) {
--- End diff --

Why is a label of `LabeledPoint` assumed as only `Double`? I think there 
are some cases where label is not `Double` such as one-of-k encoding. It seems 
better not to restrict to `Double` type. If I missed some alternatives, sorry 
for that and please let me know. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >