Re: when run the same job, time that spark used is very diffrent from shark.

2014-03-07 Thread Mayur Rustagi
So thr are static cost associated with parsing the queries, structuring the
operators but should not be that much.
Another thing is all the data is passed through a parser in Shark,
serialized  passed through filter  sent to driver.
In Spark data is simply read as text, run through contains  returns data
back to driver.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Mar 6, 2014 at 7:39 PM, qingyang li liqingyang1...@gmail.comwrote:

 *Hi, community, I have setup 3 nodes spark cluster using standalone mode,
 each machine's memery is 16G, the core is 4. *



 *when i run  val file =
 sc.textFile(/user/hive/warehouse/b/test.txt)
 file.filter(line = line.contains(2013-)).count()   *

 *it cost  2.7s , *



 *but , when i run select count(*) from b; using shark, it cost 15.81s, *



 *So,Why shark using more time than spark?  *

 *other info:*

 *1. i have set export SPARK_MEM=10g in shark-env.sh2. *
 *test.txt is 4.21G which exists on each machine's directory
 /user/hive/warehouse/b/ and *
 *test.txt has been loaded into memery.*

 *3. there are 38532979 lines in test.txt*



[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36976663
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36976664
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13039/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...

2014-03-07 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/18#issuecomment-36977010
  
Is MLI-2 not a good JIRA issue to use for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/18#issuecomment-36977058
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-36977336
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/96

[SPARK-1194] Fix the same-RDD rule for cache replacement

SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194

In the current implementation, when selecting candidate blocks to be 
swapped out, once we find a block from the same RDD that the block to be stored 
belongs to, cache eviction fails  and aborts.

In this PR, we keep selecting blocks *not* from the RDD that the block to 
be stored belongs to until either enough free space can be ensured (cache 
eviction succeeds) or all such blocks are checked (cache eviction fails).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark fix-spark-1194

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/96.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #96


commit 62c92ac7b8e616529bdaa52b73eb70e50bc01b47
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-03-07T08:32:47Z

Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-07 Thread DB Tsai
Hi Xiangrui,

I think it doesn't matter whether we use Fortran/Breeze/RISO for
optimizers since optimization only takes  1% of time. Most of the
time is in gradientSum and lossSum parallel computation.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/


On Thu, Mar 6, 2014 at 7:10 PM, Xiangrui Meng men...@gmail.com wrote:
 Hi DB,

 Thanks for doing the comparison! What were the running times for
 fortran/breeze/riso?

 Best,
 Xiangrui

 On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote:
 Hi David,

 I can converge to the same result with your breeze LBFGS and Fortran
 implementations now. Probably, I made some mistakes when I tried
 breeze before. I apologize that I claimed it's not stable.

 See the test case in BreezeLBFGSSuite.scala
 https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS

 This is training multinomial logistic regression against iris dataset,
 and both optimizers can train the models with 98% training accuracy.

 There are two issues to use Breeze in Spark,

 1) When the gradientSum and lossSum are computed distributively in
 custom defined DiffFunction which will be passed into your optimizer,
 Spark will complain LBFGS class is not serializable. In
 BreezeLBFGS.scala, I've to convert RDD to array to make it work
 locally. It should be easy to fix by just having LBFGS to implement
 Serializable.

 2) Breeze computes redundant gradient and loss. See the following log
 from both Fortran and Breeze implementations.

 Thanks.

 Fortran:
 Iteration -1: loss 1.3862943611198926, diff 1.0
 Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
 Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
 Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
 Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
 Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
 Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
 Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
 Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
 Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
 Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
 Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627

 Breeze:
 Iteration -1: loss 1.3862943611198926, diff 1.0
 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
 WARNING: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
 WARNING: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Iteration 0: loss 1.3862943611198926, diff 0.0
 Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352
 Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126
 Iteration 3: loss 1.1242501524477688, diff 0.0
 Iteration 4: loss 1.1242501524477688, diff 0.0
 Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336
 Iteration 6: loss 1.0930151243303563, diff 0.0
 Iteration 7: loss 1.0930151243303563, diff 0.0
 Iteration 8: loss 1.054036932835569, diff 0.03566113127440601
 Iteration 9: loss 1.054036932835569, diff 0.0
 Iteration 10: loss 1.054036932835569, diff 0.0
 Iteration 11: loss 0.9907956302751622, diff 0.0507649459571
 Iteration 12: loss 0.9907956302751622, diff 0.0
 Iteration 13: loss 0.9907956302751622, diff 0.0
 Iteration 14: loss 0.9184205380342829, diff 0.07304737423337761
 Iteration 15: loss 0.9184205380342829, diff 0.0
 Iteration 16: loss 0.9184205380342829, diff 0.0
 Iteration 17: loss 0.8259870936519939, diff 0.1006438117513297
 Iteration 18: loss 0.8259870936519939, diff 0.0
 Iteration 19: loss 0.8259870936519939, diff 0.0
 Iteration 20: loss 0.6327447552109576, diff 0.233952934583647
 Iteration 21: loss 0.6327447552109576, diff 0.0
 Iteration 22: loss 0.6327447552109576, diff 0.0
 Iteration 23: loss 0.5534101162436362, diff 0.12538154276652747
 Iteration 24: loss 0.5534101162436362, diff 0.0
 Iteration 25: loss 0.5534101162436362, diff 0.0
 Iteration 26: loss 0.40450200866125635, diff 0.2690732137675816
 Iteration 27: loss 0.40450200866125635, diff 0.0
 Iteration 28: loss 0.40450200866125635, diff 0.0
 Iteration 29: loss 0.30788249908237314, diff 0.23885980452569502

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/


 On Wed, Mar 5, 2014 at 2:00 PM, David Hall d...@cs.berkeley.edu wrote:
 On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote:

 Hi David,

 On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote:
  I'm happy to help fix any problems. I've verified at points that the
  implementation gives the exact same sequence of iterates for a few
 different
  functions (with a particular line search) as the c port of lbfgs. So I'm
 a
  little surprised it fails where Fortran succeeds... but only a little.
 This
  was 

[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-36980467
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-36980466
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-36980445
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13041/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-36980547
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-36980553
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/18#issuecomment-36980443
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-36980442
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/42#issuecomment-36983520
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: special case of custom partitioning

2014-03-07 Thread Manoj Awasthi
Thanks Mayur - based on the doc-comments in source looks like this will
work for the case. I will confirm.


the dreamers of the day are dangerous men, for they may act their dream
with open eyes, and make it possible


On Fri, Mar 7, 2014 at 2:21 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 How about PartitionerAwareUnionRDD?

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Mar 6, 2014 at 9:42 AM, Evan Chan e...@ooyala.com wrote:

  I would love to hear the answer to this as well.
 
  On Thu, Mar 6, 2014 at 4:09 AM, Manoj Awasthi awasthi.ma...@gmail.com
  wrote:
   Hi All,
  
  
   I have a three machine cluster. I have two RDDs each consisting of
 (K,V)
   pairs. RDDs have just three keys 'a', 'b' and 'c'.
  
   // list1 - List(('a',1), ('b',2), 
   val rdd1 = sc.parallelize(list1).groupByKey(new HashPartitioner(3))
  
   // list2 - List(('a',2), ('b',7), 
   val rdd2 = sc.parallelize(list2).groupByKey(new HashPartitioner(3))
  
   By using a HashPartitioner with 3 partitions I can achieve that each of
  the
   keys ('a', 'b' and 'c') in each RDD gets partitioned on different
  machines
   on cluster (based on the hashCode).
  
   Problem is that I cannot deterministically do the same allocation for
   second RDD? (all 'a's from rdd2 going to the same machine where 'a's
 from
   first RDD went to).
  
   Is there a way to achieve this?
  
   Manoj
 
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 



[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-37012316
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37013190
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread ScrapCodes
GitHub user ScrapCodes opened a pull request:

https://github.com/apache/spark/pull/97

Spark 1162 Implemented takeOrdered in pyspark.

Since python does not have a library for max heap and usual tricks like 
inverting values etc.. does not work for all cases. So best thing I could think 
of is modify heapq itself.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ScrapCodes/spark-1 
SPARK-1162/pyspark-top-takeOrdered2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #97


commit 3e7a57506ce139af804f89f16a3404624d784f7e
Author: Prashant Sharma prashan...@imaginea.com
Date:   2014-03-06T12:12:16Z

Added top in python.

commit 3bedad7dfe3b18ee9f64cc376627d3d7489a0e9f
Author: Prashant Sharma prashan...@imaginea.com
Date:   2014-03-07T10:35:31Z

Added takeOrdered




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37013191
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37016128
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37016129
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13045/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add timeout for fetch file

2014-03-07 Thread guojc
GitHub user guojc opened a pull request:

https://github.com/apache/spark/pull/98

Add timeout for fetch file

Currently, when fetch a file, the connection's connect timeout
and read timeout is based on the default jvm setting, in this change, I 
change it to
use spark.worker.timeout. This can be usefull, when the
connection status between worker is not perfect. And prevent
prematurely remove task set.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guojc/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/98.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #98


commit 2a37c34b0f6399142f8bc093439e983313884eeb
Author: Jiacheng Guo guoj...@gmail.com
Date:   2014-03-07T15:24:05Z

Add timeout for fetch file
Currently, when fetch a file, the connection's connect timeout
and read timeout is based on the default jvm setting, in this change, I 
change it to
use spark.worker.timeout. This can be usefull, when the
connection status between worker is not perfect. And prevent
prematurely remove task set.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add timeout for fetch file

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/98#issuecomment-37033983
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/96#discussion_r10386811
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
   val blockId = pair.getKey
-  if (rddToAdd.isDefined  rddToAdd == getRddId(blockId)) {
-logInfo(Will not store  + blockIdToAdd +  as it would 
require dropping another  +
-  block from the same RDD)
-return false
+  // Apply the same-RDD rule for cache replacement. Quoted from the
+  // original RDD paper:
+  //
+  //When a new RDD partition is computed but there is not 
enough
+  //space to store it, we evict a partition from the least 
recently
+  //accessed RDD, unless this is the same RDD as the one with 
the
+  //new partition. In that case, we keep the old partition in 
memory
+  //to prevent cycling partitions from the same RDD in and out.
+  //
+  // TODO implement LRU eviction
--- End diff --

entries is already a LinkedHashMap - so you iterate in LRU : you can remove 
the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/96#discussion_r10388297
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
   val blockId = pair.getKey
-  if (rddToAdd.isDefined  rddToAdd == getRddId(blockId)) {
-logInfo(Will not store  + blockIdToAdd +  as it would 
require dropping another  +
-  block from the same RDD)
-return false
+  // Apply the same-RDD rule for cache replacement. Quoted from the
+  // original RDD paper:
+  //
+  //When a new RDD partition is computed but there is not 
enough
+  //space to store it, we evict a partition from the least 
recently
+  //accessed RDD, unless this is the same RDD as the one with 
the
+  //new partition. In that case, we keep the old partition in 
memory
+  //to prevent cycling partitions from the same RDD in and out.
+  //
+  // TODO implement LRU eviction
--- End diff --

I see


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: ALS solve.solvePositive

2014-03-07 Thread Debasish Das
Hi Xiangrui,

I used lambda = 0.1...It is possible that 2 users ranked in movies in a
very similar way...

I agree that increasing lambda will solve the problem but you agree this is
not a solution...lambda should be tuned based on sparsity / other criteria
and not to make a linearly dependent hessian matrix linearly
independent...

Thanks.
Deb





On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote:

 If the matrix is very ill-conditioned, then A^T A becomes numerically
 rank deficient. However, if you use a reasonably large positive
 regularization constant (lambda), A^T A + lambda I should be still
 positive definite. What was the regularization constant (lambda) you
 set? Could you test whether the error still happens when you use a
 large lambda?

 Best,
 Xiangrui



[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/96#discussion_r10388411
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
   val blockId = pair.getKey
-  if (rddToAdd.isDefined  rddToAdd == getRddId(blockId)) {
-logInfo(Will not store  + blockIdToAdd +  as it would 
require dropping another  +
-  block from the same RDD)
--- End diff --

Agree, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37040297
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13046/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/94#issuecomment-37040303
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13047/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/94#issuecomment-37040302
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37041120
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37041118
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/96#discussion_r10390021
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
   val blockId = pair.getKey
-  if (rddToAdd.isDefined  rddToAdd == getRddId(blockId)) {
-logInfo(Will not store  + blockIdToAdd +  as it would 
require dropping another  +
-  block from the same RDD)
-return false
+  // Apply the same-RDD rule for cache replacement. Quoted from the
+  // original RDD paper:
+  //
+  //When a new RDD partition is computed but there is not 
enough
+  //space to store it, we evict a partition from the least 
recently
+  //accessed RDD, unless this is the same RDD as the one with 
the
+  //new partition. In that case, we keep the old partition in 
memory
+  //to prevent cycling partitions from the same RDD in and out.
+  //
+  // TODO implement LRU eviction
+  rddToAdd match {
+case Some(rddId) if rddId == getRddId(blockId) =
--- End diff --

Made a mistake here, `rddId: Int == getRddId(blockId): Option[Int]` never 
holds...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37046661
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-37046789
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add timeout for fetch file

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/98#issuecomment-37052716
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-37052691
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add timeout for fetch file

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/98#issuecomment-37052715
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-37052692
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13049/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add timeout for fetch file

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/98#issuecomment-37052776
  
@guojc hey I'm wondering - if the default is -1 (unlimited, no timeout) 
then why is it removing your task set due to failure? If there is no timeout 
then won't it just wait indefinitely until the connecting comes back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/94#issuecomment-37053143
  
LGTM thanks for improving the existing code here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/99#issuecomment-37053200
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/99#issuecomment-37053201
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM

2014-03-07 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/99

SPARK-929: Fully deprecate usage of SPARK_MEM

(Continued from old repo, prior discussion at 
https://github.com/apache/incubator-spark/pull/615)

This patch cements our deprecation of the SPARK_MEM environment variable by 
replacing it with three more specialized variables:
SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY

The creation of the latter two variables means that we can safely set 
driver/job memory without accidentally setting the executor memory. Neither is 
public.

SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within 
SparkContext). The proper way of configuring executor memory is through the 
spark.executor.memory property.

SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run 
by jobs launched by spark-class, without possibly affecting executor memory.

Other memory considerations:
- The repl's memory can be set through the --drivermem command-line 
option, which really just sets SPARK_DRIVER_MEMORY.
- run-example doesn't use spark-class, so the only way to modify examples' 
memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally 
overriden in all cases by spark-class).

This patch also fixes a lurking bug where spark-shell misused spark-class 
(the first argument is supposed to be the main class name, not java options), 
as well as a bug in the Windows spark-class2.cmd. I have not yet tested this 
patch on either Windows or Mesos, however.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark sparkmem

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/99.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #99


commit 9df4c68262dac0edde1ec5bdd1fd065d2bf34e00
Author: Aaron Davidson aa...@databricks.com
Date:   2014-02-17T23:09:51Z

SPARK-929: Fully deprecate usage of SPARK_MEM

This patch cements our deprecation of the SPARK_MEM environment variable
by replacing it with case-specific variables:
SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY

The creation of the latter two variables means that we can safely
set driver/job memory without accidentally setting the executor memory.
Neither is public.

SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set
within SparkContext). The proper way of configuring executor memory
is through the spark.executor.memory property.

SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory
run by jobs launched by spark-class, without possibly affecting executor
memory.

Other memory considerations:
- The repl's memory can be set through the --drivermem command-line 
option,
  which really just sets SPARK_DRIVER_MEMORY.
- run-example doesn't use spark-class, so the only way to modify examples'
  memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally
  overriden in all cases by spark-class).

This patch also fixes a lurking bug where spark-shell misused spark-class
(the first argument is supposed to be the main class name, not java
options).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/94#issuecomment-37053538
  
thanks tom, merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add timeout for fetch file

2014-03-07 Thread guojc
Github user guojc commented on the pull request:

https://github.com/apache/spark/pull/98#issuecomment-37054016
  
I'm not sure the behavior of default -1, as in 
http://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html#setReadTimeout%28int%29
 says 0 is for infinity. But we do observe some connection error related to 
fetcher. We want to set the value to a comfortable zone .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/80#issuecomment-37054161
  
@ScrapCodes I think the original scaladoc explains that this performs a 
shuffle, but you didn't copy this code in any of the python/java docs. Would 
you mind adding that? It's sort of important because otherwise people could 
think this is a cheap operation.
```
  /**
   * Return the intersection of this RDD and another one. The output will 
not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * Note that this method performs a shuffle internally.
   */
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/96#discussion_r10394468
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -236,13 +236,18 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
   val blockId = pair.getKey
-  if (rddToAdd.isDefined  rddToAdd == getRddId(blockId)) {
-logInfo(Will not store  + blockIdToAdd +  as it would 
require dropping another  +
-  block from the same RDD)
-return false
+  // Apply the same-RDD rule for cache replacement. Quoted from the
+  // original RDD paper:
+  //
+  //When a new RDD partition is computed but there is not 
enough
--- End diff --

Hey @liancheng I think it's okay to remove this quote. If you look at the 
scaladoc it already explains the intended policy wrt to partitions in the same 
RDD - so I think that is sufficient. The scaladoc says which leads to a 
wasteful cyclic replacement pattern for RDDs don't fit into memory that we want 
to avoid


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/96#discussion_r10394826
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -236,13 +236,18 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
   val blockId = pair.getKey
-  if (rddToAdd.isDefined  rddToAdd == getRddId(blockId)) {
-logInfo(Will not store  + blockIdToAdd +  as it would 
require dropping another  +
-  block from the same RDD)
-return false
+  // Apply the same-RDD rule for cache replacement. Quoted from the
+  // original RDD paper:
+  //
+  //When a new RDD partition is computed but there is not 
enough
--- End diff --

Thanks, removed :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1186] : Enrich the Spark Shell to suppo...

2014-03-07 Thread berngp
Github user berngp commented on the pull request:

https://github.com/apache/spark/pull/84#issuecomment-37055758
  
@pwendell,@aarondav, @sryza couple of questions.
1. Based [SPARK-929] would it make sense to also include 
--spark-daemon-memory as an optional argument.?
2. Should I rebase my changes taking into account  [SPARK-929], I assume I 
should.
3. Might make sense to have a ./bin/_funcitons.sh to share bash functions 
across scripts, mainly used by spark-shell and spark-submit (based on 
[SPARK-1126]) e.g. of functions could be INFO, WARN, ERROR messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Spark 0.9.0 and log4j

2014-03-07 Thread Evan Chan
Hey guys,

This is a follow-up to this semi-recent thread:
http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html

0.9.0 final is causing issues for us as well because we use Logback as
our backend and Spark requires Log4j now.

I see Patrick has a PR #560 to incubator-spark, was that merged in or
left out?

Also I see references to a new PR that might fix this, but I can't
seem to find it in the github open PR page.   Anybody have a link?

As a last resort we can switch to Log4j, but would rather not have to
do that if possible.

thanks,
Evan

-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...

2014-03-07 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/42#issuecomment-37057167
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/99#issuecomment-37058576
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-37058765
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/42#issuecomment-37058828
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/42#issuecomment-37058830
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/42#issuecomment-37064296
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-37064310
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/86#discussion_r10399046
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.io.File
+import java.net.URL
+import java.net.URLClassLoader
+
+import scala.collection.mutable.ArrayBuffer
+
+object SparkSubmit {
+  val YARN = 1
+  val STANDALONE = 2
+  val MESOS = 4
+  val LOCAL = 8
+  val ALL_CLUSTER_MGRS = YARN | STANDALONE | MESOS | LOCAL
+
+  var clusterManager: Int = LOCAL
+
+  def main(args: Array[String]) {
+val appArgs = new SparkSubmitArguments(args)
+
+if (appArgs.master != null) {
+  if (appArgs.master.startsWith(yarn)) {
+clusterManager = YARN
+  } else if (appArgs.master.startsWith(spark)) {
+clusterManager = STANDALONE
+  } else if (appArgs.master.startsWith(mesos)) {
+clusterManager = MESOS
+  } else if (appArgs.master.startsWith(local)) {
+clusterManager = LOCAL
+  } else {
+System.err.println(master must start with yarn, mesos, spark, or 
local)
+System.exit(1)
+  }
+}
+
+val deployOnCluster = appArgs.deployMode == cluster
--- End diff --

On the other hand, it might make more sense to move towards consistency 
between yarn and standalone/mesos, for which MASTER only specifies the cluster 
manager, and not the application's deploy mode.  For this, we would allow just 
giving --master to spark-submit as yarn, and yarn-client vs. yarn-standalone 
would be inferred depending on --deploy-mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/100#issuecomment-37079388
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.

2014-03-07 Thread pwendell
GitHub user pwendell opened a pull request:

https://github.com/apache/spark/pull/100

SPARK-782 Clean up for ASM dependency.

This makes two changes.

1) Spark uses the shaded version of asm that is (conveniently) published
   with Kryo.
2) Existing exclude rules around asm are updated to reflect the new groupId
   of `org.ow2.asm`. This made all of the old rules not work with newer 
Hadoop
   versions that pull in new asm versions.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pwendell/spark asm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #100


commit 660420f6ee08e4cf17d515f517ea0561b1e9636c
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-03-07T23:09:07Z

SPARK-782 Clean up for ASM dependency.

This makes two changes.

1) Spark uses the shaded version of asm that is (conveniently) published
   with Kryo.
2) Existing exclude rules around asm are updated to reflect the new groupId
   of `org.ow2.asm`. This made all of the old rules not work with newer 
Hadoop
   versions that pull in new asm versions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/100#issuecomment-37079425
  
Come to think of it, we may want to stop excluding asm now since we don't 
directly use it anymore (therefore there can be no conflicts w/ Spark). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/86#discussion_r10405655
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.io.File
+import java.net.URL
+import java.net.URLClassLoader
+
+import scala.collection.mutable.ArrayBuffer
+
+object SparkSubmit {
+  val YARN = 1
+  val STANDALONE = 2
+  val MESOS = 4
+  val LOCAL = 8
+  val ALL_CLUSTER_MGRS = YARN | STANDALONE | MESOS | LOCAL
+
+  var clusterManager: Int = LOCAL
+
+  def main(args: Array[String]) {
+val appArgs = new SparkSubmitArguments(args)
+
+if (appArgs.master != null) {
+  if (appArgs.master.startsWith(yarn)) {
+clusterManager = YARN
+  } else if (appArgs.master.startsWith(spark)) {
+clusterManager = STANDALONE
+  } else if (appArgs.master.startsWith(mesos)) {
+clusterManager = MESOS
+  } else if (appArgs.master.startsWith(local)) {
+clusterManager = LOCAL
+  } else {
+System.err.println(master must start with yarn, mesos, spark, or 
local)
+System.exit(1)
+  }
+}
+
+val deployOnCluster = appArgs.deployMode == cluster
+val childClasspath = new ArrayBuffer[String]()
+val childArgs = new ArrayBuffer[String]()
+var childMainClass = 
+
+if (clusterManager == MESOS  deployOnCluster) {
+  System.err.println(Mesos does not support running the driver on the 
cluster)
+  System.exit(1)
+}
+
+if (deployOnCluster  clusterManager == STANDALONE) {
+  childMainClass = org.apache.spark.deploy.Client
+  childArgs += launch
+  childArgs += (appArgs.master, appArgs.primaryResource, 
appArgs.mainClass)
+} else if (deployOnCluster  clusterManager == YARN) {
+  childMainClass = org.apache.spark.deploy.yarn.Client
+  childArgs += (--jar, appArgs.primaryResource)
+  childArgs += (--class, appArgs.mainClass)
+} else {
+  childMainClass = appArgs.mainClass
+  childClasspath += appArgs.primaryResource
+}
+
+val options = List[OptionAssigner](
+  new OptionAssigner(appArgs.driverMemory, YARN, true, clOption = 
--master-memory),
+  new OptionAssigner(appArgs.name, YARN, true, clOption = --name),
+  new OptionAssigner(appArgs.queue, YARN, true, clOption = --queue),
+  new OptionAssigner(appArgs.queue, YARN, false, sysProp = 
spark.yarn.queue),
+  new OptionAssigner(appArgs.numExecutors, YARN, true, clOption = 
--num-workers),
+  new OptionAssigner(appArgs.numExecutors, YARN, false, sysProp = 
spark.worker.instances),
+  new OptionAssigner(appArgs.executorMemory, YARN, false, clOption = 
--worker-memory),
+  new OptionAssigner(appArgs.executorMemory, STANDALONE, true, 
clOption = --memory),
+  new OptionAssigner(appArgs.executorMemory, STANDALONE | MESOS | 
YARN, false, sysProp = spark.executor.memory),
+  new OptionAssigner(appArgs.executorCores, YARN, true, clOption = 
--worker-cores),
+  new OptionAssigner(appArgs.executorCores, STANDALONE, true, clOption 
= --cores),
+  new OptionAssigner(appArgs.executorCores, STANDALONE | MESOS | YARN, 
false, sysProp = spark.cores.max),
+  new OptionAssigner(appArgs.files, YARN, false, sysProp = 
spark.yarn.dist.files),
+  new OptionAssigner(appArgs.files, YARN, true, clOption = --files),
+  new OptionAssigner(appArgs.archives, YARN, false, sysProp = 
spark.yarn.dist.archives),
+  new OptionAssigner(appArgs.archives, YARN, true, clOption = 
--archives),
+  new OptionAssigner(appArgs.moreJars, YARN, true, clOption = 
--addJars)
+)
+
+// more jars
+if (appArgs.moreJars != null  !deployOnCluster) {
+  childClasspath += appArgs.moreJars
+}
  

JAVA Cassanra Test example

2014-03-07 Thread sateesh
import java.io.IOException;
import java.io.Serializable;
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.regex.Matcher;

import org.apache.cassandra.db.Column;
import org.apache.cassandra.hadoop.ColumnFamilyInputFormat;
import org.apache.cassandra.hadoop.ConfigHelper;
import org.apache.cassandra.thrift.SlicePredicate;
import org.apache.cassandra.thrift.SliceRange;
import org.apache.cassandra.utils.ByteBufferUtil;
import org.apache.hadoop.mapreduce.Job;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;
import scala.Tuple3;

public class CassandraSparkConnectionTest implements Serializable{

public static void main(String[] args) throws IOException {
new CassandraSparkConnectionTest().process();
}

@SuppressWarnings({ unchecked, serial })
public void process() throws IOException {
String host = localhost;
String port = 9160;

JavaSparkContext sparkContext = new JavaSparkContext(local,
cassandraSparkConnectionTest, System.getenv(SPARK_HOME),

JavaSparkContext.jarOfClass(CassandraSparkConnectivity.class));
Job job = new Job();
job.setInputFormatClass(ColumnFamilyInputFormat.class);

ConfigHelper.setInputInitialAddress(job.getConfiguration(), 
host);
ConfigHelper.setInputRpcPort(job.getConfiguration(), port);
ConfigHelper.setOutputInitialAddress(job.getConfiguration(), 
host);
ConfigHelper.setOutputRpcPort(job.getConfiguration(), port);
ConfigHelper.setInputColumnFamily(job.getConfiguration(), 
casDemo,
Words);
//ConfigHelper.setOutputColumnFamily(job.getConfiguration(), 
casDemo,
WordCount);

ConfigHelper.setInputPartitioner(job.getConfiguration(),
Murmur3Partitioner);
//ConfigHelper.setOutputPartitioner(job.getConfiguration(),
Murmur3Partitioner);

SlicePredicate predicate = new SlicePredicate();
SliceRange sliceRange = new SliceRange(toByteBuffer(), 
toByteBuffer(),
false, 20);
predicate.setSlice_range(sliceRange);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), 
predicate);

MapByteBuffer, Column valueClass = new TreeMapByteBuffer, 
Column();

JavaPairRDDByteBuffer, TreeMaplt;ByteBuffer, Column rdd = 
sparkContext
.newAPIHadoopRDD(job.getConfiguration(),

ColumnFamilyInputFormat.class.asSubclass(org.apache.hadoop.mapreduce.InputFormat.class),
ByteBuffer.class, 
valueClass.getClass());

JavaPairRDDByteBuffer, Column pair =  rdd.map (new 
PairFunction Tuple2
ByteBuffer, TreeMapByteBuffer, Column, ByteBuffer, Column () {
@Override
public Tuple2ByteBuffer, Column call(
Tuple2ByteBuffer, 
TreeMaplt;ByteBuffer, Column paramT)
throws Exception {

System.out.println(ByteBufferUtil.string(paramT._1()));
SetByteBuffer keys = paramT._2.keySet();
for(ByteBuffer key : keys) {
System.out.println(\t + 
ByteBufferUtil.string(key));
Column col = paramT._2().get(key);
System.out.println(\t + 
ByteBufferUtil.string(col.value()));
}
return null; //Add code
}
});

pair.collect();

System.out.println(Done.);
}

public static Tuple3String, String, String extractKey(String s) {
Matcher m = null;
ListString key = Collections.emptyList();
if (m.find()) {
String ip = m.group(1);
String user = m.group(3);
String query = m.group(5);
if (!user.equalsIgnoreCase(-)) {
return new Tuple3String, String, String(ip, 
user, query);
}
}
return new Tuple3String, String, String(null, null, null);
}

public static ByteBuffer toByteBuffer(String value)
throws UnsupportedEncodingException {
if (value == null) {
 

[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/101#issuecomment-37082403
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread hsaputra
Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/spark/pull/86#discussion_r10406142
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import scala.collection.mutable.ArrayBuffer
+
+private[spark] class SparkSubmitArguments(args: Array[String]) {
--- End diff --

Please add class comment to explain why this class exist and how would it 
being used or relate to other classes.
Few months from now it would make it easier to immediately understand how 
this class fits in the overall picture by just looking at the summary of the 
class than have to do a search of usage with IDE in the source repo =)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.

2014-03-07 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/100#issuecomment-37084570
  
Will this also work on Java 8?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/97#discussion_r10406957
  
--- Diff: python/pyspark/maxheapq.py ---
@@ -0,0 +1,115 @@
+# -*- coding: latin-1 -*-
+
+Heap queue algorithm (a.k.a. priority queue).
+
+# Original code by Kevin O'Connor, augmented by Tim Peters and Raymond 
Hettinger
--- End diff --

What license was this under? Not sure we can just include it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.

2014-03-07 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/100#issuecomment-37086194
  
ah got it, thanks. so asm 3.x will be on classpath wether we like it or 
not. and we remove all other asm dependencies here, except for a kryo version.

will chill serialization still work this way? will it somehow find the kryo 
asm?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark-1163, Added missing Python RDD functions

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/92#issuecomment-37086543
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13057/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark-1163, Added missing Python RDD functions

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/92#issuecomment-37086542
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37086581
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/101#issuecomment-37086647
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/80#issuecomment-37086649
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/80#issuecomment-37086650
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/80#issuecomment-37086967
  
Thanks, merging this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.

2014-03-07 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/100#issuecomment-37087149
  
@koertkuipers so I looked at chill and they don't use ASM except inside of 
the ClosureCleaner (which they actually borrowed from Spark). Since we don't 
use chill's closurecleaner things should be alright at runtime. I did create a 
PR for chillw to do the same thing that we are doing in Spark [1], but we don't 
depend on that for things to work.

[1] https://github.com/twitter/chill/pull/175


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/80


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/101#issuecomment-37087798
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37087805
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/80#issuecomment-37087803
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13060/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37087802
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/80#issuecomment-37087801
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37087806
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/101#issuecomment-37088861
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/101#issuecomment-37088862
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37088856
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13061/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/97#issuecomment-37088855
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/86#issuecomment-37088913
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/86#issuecomment-37088937
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13063/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/86#issuecomment-37088936
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1126. spark-app preliminary

2014-03-07 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/86#issuecomment-37088962
  
Newest patch includes tests and doc.  @pwendell, do you have a link to the 
addJar patch?  If it's definitely going to happen, I'll take out the 
classloader stuff here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1193. Fix indentation in pom.xmls

2014-03-07 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/91#issuecomment-37089032
  
Upmerged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...

2014-03-07 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/96#issuecomment-37089290
  
@pwendell Regression test case added, also ensured that the old 
implementation fails on this test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1064

2014-03-07 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/102

SPARK-1064

This reopens PR 649 from incubator-spark against the new repo

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-1064

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/102.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #102


commit 552fc04009e15547b315ff8eabbec5c4b1659002
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-02-19T00:30:06Z

SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in 
Spark assembly

commit 4380ad5b24096f4977bd2d97ff3fde808da4660f
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-03-08T04:58:14Z

sbt change




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/101#issuecomment-37089804
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13062/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1193. Fix indentation in pom.xmls

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/91#issuecomment-37089817
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1193. Fix indentation in pom.xmls

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/91#issuecomment-37089816
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >