Re: when run the same job, time that spark used is very diffrent from shark.
So thr are static cost associated with parsing the queries, structuring the operators but should not be that much. Another thing is all the data is passed through a parser in Shark, serialized passed through filter sent to driver. In Spark data is simply read as text, run through contains returns data back to driver. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Mar 6, 2014 at 7:39 PM, qingyang li liqingyang1...@gmail.comwrote: *Hi, community, I have setup 3 nodes spark cluster using standalone mode, each machine's memery is 16G, the core is 4. * *when i run val file = sc.textFile(/user/hive/warehouse/b/test.txt) file.filter(line = line.contains(2013-)).count() * *it cost 2.7s , * *but , when i run select count(*) from b; using shark, it cost 15.81s, * *So,Why shark using more time than spark? * *other info:* *1. i have set export SPARK_MEM=10g in shark-env.sh2. * *test.txt is 4.21G which exists on each machine's directory /user/hive/warehouse/b/ and * *test.txt has been loaded into memery.* *3. there are 38532979 lines in test.txt*
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36976663 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36976664 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13039/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/18#issuecomment-36977010 Is MLI-2 not a good JIRA issue to use for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/18#issuecomment-36977058 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-1 Decision Trees
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-36977336 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/96 [SPARK-1194] Fix the same-RDD rule for cache replacement SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194 In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails and aborts. In this PR, we keep selecting blocks *not* from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails). You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark fix-spark-1194 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/96.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #96 commit 62c92ac7b8e616529bdaa52b73eb70e50bc01b47 Author: Cheng Lian lian.cs@gmail.com Date: 2014-03-07T08:32:47Z Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
Hi Xiangrui, I think it doesn't matter whether we use Fortran/Breeze/RISO for optimizers since optimization only takes 1% of time. Most of the time is in gradientSum and lossSum parallel computation. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Thu, Mar 6, 2014 at 7:10 PM, Xiangrui Meng men...@gmail.com wrote: Hi DB, Thanks for doing the comparison! What were the running times for fortran/breeze/riso? Best, Xiangrui On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, I can converge to the same result with your breeze LBFGS and Fortran implementations now. Probably, I made some mistakes when I tried breeze before. I apologize that I claimed it's not stable. See the test case in BreezeLBFGSSuite.scala https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS This is training multinomial logistic regression against iris dataset, and both optimizers can train the models with 98% training accuracy. There are two issues to use Breeze in Spark, 1) When the gradientSum and lossSum are computed distributively in custom defined DiffFunction which will be passed into your optimizer, Spark will complain LBFGS class is not serializable. In BreezeLBFGS.scala, I've to convert RDD to array to make it work locally. It should be easy to fix by just having LBFGS to implement Serializable. 2) Breeze computes redundant gradient and loss. See the following log from both Fortran and Breeze implementations. Thanks. Fortran: Iteration -1: loss 1.3862943611198926, diff 1.0 Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 Breeze: Iteration -1: loss 1.3862943611198926, diff 1.0 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Iteration 0: loss 1.3862943611198926, diff 0.0 Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 3: loss 1.1242501524477688, diff 0.0 Iteration 4: loss 1.1242501524477688, diff 0.0 Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 6: loss 1.0930151243303563, diff 0.0 Iteration 7: loss 1.0930151243303563, diff 0.0 Iteration 8: loss 1.054036932835569, diff 0.03566113127440601 Iteration 9: loss 1.054036932835569, diff 0.0 Iteration 10: loss 1.054036932835569, diff 0.0 Iteration 11: loss 0.9907956302751622, diff 0.0507649459571 Iteration 12: loss 0.9907956302751622, diff 0.0 Iteration 13: loss 0.9907956302751622, diff 0.0 Iteration 14: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 15: loss 0.9184205380342829, diff 0.0 Iteration 16: loss 0.9184205380342829, diff 0.0 Iteration 17: loss 0.8259870936519939, diff 0.1006438117513297 Iteration 18: loss 0.8259870936519939, diff 0.0 Iteration 19: loss 0.8259870936519939, diff 0.0 Iteration 20: loss 0.6327447552109576, diff 0.233952934583647 Iteration 21: loss 0.6327447552109576, diff 0.0 Iteration 22: loss 0.6327447552109576, diff 0.0 Iteration 23: loss 0.5534101162436362, diff 0.12538154276652747 Iteration 24: loss 0.5534101162436362, diff 0.0 Iteration 25: loss 0.5534101162436362, diff 0.0 Iteration 26: loss 0.40450200866125635, diff 0.2690732137675816 Iteration 27: loss 0.40450200866125635, diff 0.0 Iteration 28: loss 0.40450200866125635, diff 0.0 Iteration 29: loss 0.30788249908237314, diff 0.23885980452569502 Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Wed, Mar 5, 2014 at 2:00 PM, David Hall d...@cs.berkeley.edu wrote: On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote: I'm happy to help fix any problems. I've verified at points that the implementation gives the exact same sequence of iterates for a few different functions (with a particular line search) as the c port of lbfgs. So I'm a little surprised it fails where Fortran succeeds... but only a little. This was
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-36980467 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-36980466 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-1 Decision Trees
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-36980445 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13041/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-1 Decision Trees
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-36980547 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-1 Decision Trees
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-36980553 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/18#issuecomment-36980443 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-1 Decision Trees
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-36980442 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/42#issuecomment-36983520 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: special case of custom partitioning
Thanks Mayur - based on the doc-comments in source looks like this will work for the case. I will confirm. the dreamers of the day are dangerous men, for they may act their dream with open eyes, and make it possible On Fri, Mar 7, 2014 at 2:21 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: How about PartitionerAwareUnionRDD? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Mar 6, 2014 at 9:42 AM, Evan Chan e...@ooyala.com wrote: I would love to hear the answer to this as well. On Thu, Mar 6, 2014 at 4:09 AM, Manoj Awasthi awasthi.ma...@gmail.com wrote: Hi All, I have a three machine cluster. I have two RDDs each consisting of (K,V) pairs. RDDs have just three keys 'a', 'b' and 'c'. // list1 - List(('a',1), ('b',2), val rdd1 = sc.parallelize(list1).groupByKey(new HashPartitioner(3)) // list2 - List(('a',2), ('b',7), val rdd2 = sc.parallelize(list2).groupByKey(new HashPartitioner(3)) By using a HashPartitioner with 3 partitions I can achieve that each of the keys ('a', 'b' and 'c') in each RDD gets partitioned on different machines on cluster (based on the hashCode). Problem is that I cannot deterministically do the same allocation for second RDD? (all 'a's from rdd2 going to the same machine where 'a's from first RDD went to). Is there a way to achieve this? Manoj -- -- Evan Chan Staff Engineer e...@ooyala.com |
[GitHub] spark pull request: MLI-1 Decision Trees
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-37012316 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37013190 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
GitHub user ScrapCodes opened a pull request: https://github.com/apache/spark/pull/97 Spark 1162 Implemented takeOrdered in pyspark. Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases. So best thing I could think of is modify heapq itself. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ScrapCodes/spark-1 SPARK-1162/pyspark-top-takeOrdered2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/97.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #97 commit 3e7a57506ce139af804f89f16a3404624d784f7e Author: Prashant Sharma prashan...@imaginea.com Date: 2014-03-06T12:12:16Z Added top in python. commit 3bedad7dfe3b18ee9f64cc376627d3d7489a0e9f Author: Prashant Sharma prashan...@imaginea.com Date: 2014-03-07T10:35:31Z Added takeOrdered --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37013191 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37016128 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37016129 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13045/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add timeout for fetch file
GitHub user guojc opened a pull request: https://github.com/apache/spark/pull/98 Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guojc/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/98.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #98 commit 2a37c34b0f6399142f8bc093439e983313884eeb Author: Jiacheng Guo guoj...@gmail.com Date: 2014-03-07T15:24:05Z Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add timeout for fetch file
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/98#issuecomment-37033983 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/96#discussion_r10386811 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, maxMemory: Long) while (maxMemory - (currentMemory - selectedMemory) space iterator.hasNext) { val pair = iterator.next() val blockId = pair.getKey - if (rddToAdd.isDefined rddToAdd == getRddId(blockId)) { -logInfo(Will not store + blockIdToAdd + as it would require dropping another + - block from the same RDD) -return false + // Apply the same-RDD rule for cache replacement. Quoted from the + // original RDD paper: + // + //When a new RDD partition is computed but there is not enough + //space to store it, we evict a partition from the least recently + //accessed RDD, unless this is the same RDD as the one with the + //new partition. In that case, we keep the old partition in memory + //to prevent cycling partitions from the same RDD in and out. + // + // TODO implement LRU eviction --- End diff -- entries is already a LinkedHashMap - so you iterate in LRU : you can remove the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/96#discussion_r10388297 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, maxMemory: Long) while (maxMemory - (currentMemory - selectedMemory) space iterator.hasNext) { val pair = iterator.next() val blockId = pair.getKey - if (rddToAdd.isDefined rddToAdd == getRddId(blockId)) { -logInfo(Will not store + blockIdToAdd + as it would require dropping another + - block from the same RDD) -return false + // Apply the same-RDD rule for cache replacement. Quoted from the + // original RDD paper: + // + //When a new RDD partition is computed but there is not enough + //space to store it, we evict a partition from the least recently + //accessed RDD, unless this is the same RDD as the one with the + //new partition. In that case, we keep the old partition in memory + //to prevent cycling partitions from the same RDD in and out. + // + // TODO implement LRU eviction --- End diff -- I see --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: ALS solve.solvePositive
Hi Xiangrui, I used lambda = 0.1...It is possible that 2 users ranked in movies in a very similar way... I agree that increasing lambda will solve the problem but you agree this is not a solution...lambda should be tuned based on sparsity / other criteria and not to make a linearly dependent hessian matrix linearly independent... Thanks. Deb On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote: If the matrix is very ill-conditioned, then A^T A becomes numerically rank deficient. However, if you use a reasonably large positive regularization constant (lambda), A^T A + lambda I should be still positive definite. What was the regularization constant (lambda) you set? Could you test whether the error still happens when you use a large lambda? Best, Xiangrui
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/96#discussion_r10388411 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, maxMemory: Long) while (maxMemory - (currentMemory - selectedMemory) space iterator.hasNext) { val pair = iterator.next() val blockId = pair.getKey - if (rddToAdd.isDefined rddToAdd == getRddId(blockId)) { -logInfo(Will not store + blockIdToAdd + as it would require dropping another + - block from the same RDD) --- End diff -- Agree, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37040297 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13046/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/94#issuecomment-37040303 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13047/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/94#issuecomment-37040302 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37041120 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37041118 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/96#discussion_r10390021 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -236,13 +236,23 @@ private class MemoryStore(blockManager: BlockManager, maxMemory: Long) while (maxMemory - (currentMemory - selectedMemory) space iterator.hasNext) { val pair = iterator.next() val blockId = pair.getKey - if (rddToAdd.isDefined rddToAdd == getRddId(blockId)) { -logInfo(Will not store + blockIdToAdd + as it would require dropping another + - block from the same RDD) -return false + // Apply the same-RDD rule for cache replacement. Quoted from the + // original RDD paper: + // + //When a new RDD partition is computed but there is not enough + //space to store it, we evict a partition from the least recently + //accessed RDD, unless this is the same RDD as the one with the + //new partition. In that case, we keep the old partition in memory + //to prevent cycling partitions from the same RDD in and out. + // + // TODO implement LRU eviction + rddToAdd match { +case Some(rddId) if rddId == getRddId(blockId) = --- End diff -- Made a mistake here, `rddId: Int == getRddId(blockId): Option[Int]` never holds... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37046661 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-37046789 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add timeout for fetch file
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/98#issuecomment-37052716 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-37052691 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add timeout for fetch file
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/98#issuecomment-37052715 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-37052692 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13049/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add timeout for fetch file
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/98#issuecomment-37052776 @guojc hey I'm wondering - if the default is -1 (unlimited, no timeout) then why is it removing your task set due to failure? If there is no timeout then won't it just wait indefinitely until the connecting comes back? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/94#issuecomment-37053143 LGTM thanks for improving the existing code here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/99#issuecomment-37053200 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/99#issuecomment-37053201 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM
GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/99 SPARK-929: Fully deprecate usage of SPARK_MEM (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615) This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables: SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public. SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the spark.executor.memory property. SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory. Other memory considerations: - The repl's memory can be set through the --drivermem command-line option, which really just sets SPARK_DRIVER_MEMORY. - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class). This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark sparkmem Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/99.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #99 commit 9df4c68262dac0edde1ec5bdd1fd065d2bf34e00 Author: Aaron Davidson aa...@databricks.com Date: 2014-02-17T23:09:51Z SPARK-929: Fully deprecate usage of SPARK_MEM This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with case-specific variables: SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public. SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the spark.executor.memory property. SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory. Other memory considerations: - The repl's memory can be set through the --drivermem command-line option, which really just sets SPARK_DRIVER_MEMORY. - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class). This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1195: set map_input_file environment var...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/94#issuecomment-37053538 thanks tom, merged this into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add timeout for fetch file
Github user guojc commented on the pull request: https://github.com/apache/spark/pull/98#issuecomment-37054016 I'm not sure the behavior of default -1, as in http://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html#setReadTimeout%28int%29 says 0 is for infinity. But we do observe some connection error related to fetcher. We want to set the value to a comfortable zone . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/80#issuecomment-37054161 @ScrapCodes I think the original scaladoc explains that this performs a shuffle, but you didn't copy this code in any of the python/java docs. Would you mind adding that? It's sort of important because otherwise people could think this is a cheap operation. ``` /** * Return the intersection of this RDD and another one. The output will not contain any duplicate * elements, even if the input RDDs did. * * Note that this method performs a shuffle internally. */ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/96#discussion_r10394468 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -236,13 +236,18 @@ private class MemoryStore(blockManager: BlockManager, maxMemory: Long) while (maxMemory - (currentMemory - selectedMemory) space iterator.hasNext) { val pair = iterator.next() val blockId = pair.getKey - if (rddToAdd.isDefined rddToAdd == getRddId(blockId)) { -logInfo(Will not store + blockIdToAdd + as it would require dropping another + - block from the same RDD) -return false + // Apply the same-RDD rule for cache replacement. Quoted from the + // original RDD paper: + // + //When a new RDD partition is computed but there is not enough --- End diff -- Hey @liancheng I think it's okay to remove this quote. If you look at the scaladoc it already explains the intended policy wrt to partitions in the same RDD - so I think that is sufficient. The scaladoc says which leads to a wasteful cyclic replacement pattern for RDDs don't fit into memory that we want to avoid --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/96#discussion_r10394826 --- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala --- @@ -236,13 +236,18 @@ private class MemoryStore(blockManager: BlockManager, maxMemory: Long) while (maxMemory - (currentMemory - selectedMemory) space iterator.hasNext) { val pair = iterator.next() val blockId = pair.getKey - if (rddToAdd.isDefined rddToAdd == getRddId(blockId)) { -logInfo(Will not store + blockIdToAdd + as it would require dropping another + - block from the same RDD) -return false + // Apply the same-RDD rule for cache replacement. Quoted from the + // original RDD paper: + // + //When a new RDD partition is computed but there is not enough --- End diff -- Thanks, removed :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1186] : Enrich the Spark Shell to suppo...
Github user berngp commented on the pull request: https://github.com/apache/spark/pull/84#issuecomment-37055758 @pwendell,@aarondav, @sryza couple of questions. 1. Based [SPARK-929] would it make sense to also include --spark-daemon-memory as an optional argument.? 2. Should I rebase my changes taking into account [SPARK-929], I assume I should. 3. Might make sense to have a ./bin/_funcitons.sh to share bash functions across scripts, mainly used by spark-shell and spark-submit (based on [SPARK-1126]) e.g. of functions could be INFO, WARN, ERROR messages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Spark 0.9.0 and log4j
Hey guys, This is a follow-up to this semi-recent thread: http://apache-spark-developers-list.1001551.n3.nabble.com/0-9-0-forces-log4j-usage-td532.html 0.9.0 final is causing issues for us as well because we use Logback as our backend and Spark requires Log4j now. I see Patrick has a PR #560 to incubator-spark, was that merged in or left out? Also I see references to a new PR that might fix this, but I can't seem to find it in the github open PR page. Anybody have a link? As a last resort we can switch to Log4j, but would rather not have to do that if possible. thanks, Evan -- -- Evan Chan Staff Engineer e...@ooyala.com |
[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/42#issuecomment-37057167 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-929: Fully deprecate usage of SPARK_MEM
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/99#issuecomment-37058576 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-37058765 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/42#issuecomment-37058828 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/42#issuecomment-37058830 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1132] Persisting Web UI through refacto...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/42#issuecomment-37064296 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-37064310 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/86#discussion_r10399046 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.io.File +import java.net.URL +import java.net.URLClassLoader + +import scala.collection.mutable.ArrayBuffer + +object SparkSubmit { + val YARN = 1 + val STANDALONE = 2 + val MESOS = 4 + val LOCAL = 8 + val ALL_CLUSTER_MGRS = YARN | STANDALONE | MESOS | LOCAL + + var clusterManager: Int = LOCAL + + def main(args: Array[String]) { +val appArgs = new SparkSubmitArguments(args) + +if (appArgs.master != null) { + if (appArgs.master.startsWith(yarn)) { +clusterManager = YARN + } else if (appArgs.master.startsWith(spark)) { +clusterManager = STANDALONE + } else if (appArgs.master.startsWith(mesos)) { +clusterManager = MESOS + } else if (appArgs.master.startsWith(local)) { +clusterManager = LOCAL + } else { +System.err.println(master must start with yarn, mesos, spark, or local) +System.exit(1) + } +} + +val deployOnCluster = appArgs.deployMode == cluster --- End diff -- On the other hand, it might make more sense to move towards consistency between yarn and standalone/mesos, for which MASTER only specifies the cluster manager, and not the application's deploy mode. For this, we would allow just giving --master to spark-submit as yarn, and yarn-client vs. yarn-standalone would be inferred depending on --deploy-mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/100#issuecomment-37079388 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.
GitHub user pwendell opened a pull request: https://github.com/apache/spark/pull/100 SPARK-782 Clean up for ASM dependency. This makes two changes. 1) Spark uses the shaded version of asm that is (conveniently) published with Kryo. 2) Existing exclude rules around asm are updated to reflect the new groupId of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop versions that pull in new asm versions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pwendell/spark asm Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/100.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #100 commit 660420f6ee08e4cf17d515f517ea0561b1e9636c Author: Patrick Wendell pwend...@gmail.com Date: 2014-03-07T23:09:07Z SPARK-782 Clean up for ASM dependency. This makes two changes. 1) Spark uses the shaded version of asm that is (conveniently) published with Kryo. 2) Existing exclude rules around asm are updated to reflect the new groupId of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop versions that pull in new asm versions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/100#issuecomment-37079425 Come to think of it, we may want to stop excluding asm now since we don't directly use it anymore (therefore there can be no conflicts w/ Spark). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/86#discussion_r10405655 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.io.File +import java.net.URL +import java.net.URLClassLoader + +import scala.collection.mutable.ArrayBuffer + +object SparkSubmit { + val YARN = 1 + val STANDALONE = 2 + val MESOS = 4 + val LOCAL = 8 + val ALL_CLUSTER_MGRS = YARN | STANDALONE | MESOS | LOCAL + + var clusterManager: Int = LOCAL + + def main(args: Array[String]) { +val appArgs = new SparkSubmitArguments(args) + +if (appArgs.master != null) { + if (appArgs.master.startsWith(yarn)) { +clusterManager = YARN + } else if (appArgs.master.startsWith(spark)) { +clusterManager = STANDALONE + } else if (appArgs.master.startsWith(mesos)) { +clusterManager = MESOS + } else if (appArgs.master.startsWith(local)) { +clusterManager = LOCAL + } else { +System.err.println(master must start with yarn, mesos, spark, or local) +System.exit(1) + } +} + +val deployOnCluster = appArgs.deployMode == cluster +val childClasspath = new ArrayBuffer[String]() +val childArgs = new ArrayBuffer[String]() +var childMainClass = + +if (clusterManager == MESOS deployOnCluster) { + System.err.println(Mesos does not support running the driver on the cluster) + System.exit(1) +} + +if (deployOnCluster clusterManager == STANDALONE) { + childMainClass = org.apache.spark.deploy.Client + childArgs += launch + childArgs += (appArgs.master, appArgs.primaryResource, appArgs.mainClass) +} else if (deployOnCluster clusterManager == YARN) { + childMainClass = org.apache.spark.deploy.yarn.Client + childArgs += (--jar, appArgs.primaryResource) + childArgs += (--class, appArgs.mainClass) +} else { + childMainClass = appArgs.mainClass + childClasspath += appArgs.primaryResource +} + +val options = List[OptionAssigner]( + new OptionAssigner(appArgs.driverMemory, YARN, true, clOption = --master-memory), + new OptionAssigner(appArgs.name, YARN, true, clOption = --name), + new OptionAssigner(appArgs.queue, YARN, true, clOption = --queue), + new OptionAssigner(appArgs.queue, YARN, false, sysProp = spark.yarn.queue), + new OptionAssigner(appArgs.numExecutors, YARN, true, clOption = --num-workers), + new OptionAssigner(appArgs.numExecutors, YARN, false, sysProp = spark.worker.instances), + new OptionAssigner(appArgs.executorMemory, YARN, false, clOption = --worker-memory), + new OptionAssigner(appArgs.executorMemory, STANDALONE, true, clOption = --memory), + new OptionAssigner(appArgs.executorMemory, STANDALONE | MESOS | YARN, false, sysProp = spark.executor.memory), + new OptionAssigner(appArgs.executorCores, YARN, true, clOption = --worker-cores), + new OptionAssigner(appArgs.executorCores, STANDALONE, true, clOption = --cores), + new OptionAssigner(appArgs.executorCores, STANDALONE | MESOS | YARN, false, sysProp = spark.cores.max), + new OptionAssigner(appArgs.files, YARN, false, sysProp = spark.yarn.dist.files), + new OptionAssigner(appArgs.files, YARN, true, clOption = --files), + new OptionAssigner(appArgs.archives, YARN, false, sysProp = spark.yarn.dist.archives), + new OptionAssigner(appArgs.archives, YARN, true, clOption = --archives), + new OptionAssigner(appArgs.moreJars, YARN, true, clOption = --addJars) +) + +// more jars +if (appArgs.moreJars != null !deployOnCluster) { + childClasspath += appArgs.moreJars +}
JAVA Cassanra Test example
import java.io.IOException; import java.io.Serializable; import java.io.UnsupportedEncodingException; import java.nio.ByteBuffer; import java.util.Collections; import java.util.List; import java.util.Map; import java.util.Set; import java.util.TreeMap; import java.util.regex.Matcher; import org.apache.cassandra.db.Column; import org.apache.cassandra.hadoop.ColumnFamilyInputFormat; import org.apache.cassandra.hadoop.ConfigHelper; import org.apache.cassandra.thrift.SlicePredicate; import org.apache.cassandra.thrift.SliceRange; import org.apache.cassandra.utils.ByteBufferUtil; import org.apache.hadoop.mapreduce.Job; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; import scala.Tuple3; public class CassandraSparkConnectionTest implements Serializable{ public static void main(String[] args) throws IOException { new CassandraSparkConnectionTest().process(); } @SuppressWarnings({ unchecked, serial }) public void process() throws IOException { String host = localhost; String port = 9160; JavaSparkContext sparkContext = new JavaSparkContext(local, cassandraSparkConnectionTest, System.getenv(SPARK_HOME), JavaSparkContext.jarOfClass(CassandraSparkConnectivity.class)); Job job = new Job(); job.setInputFormatClass(ColumnFamilyInputFormat.class); ConfigHelper.setInputInitialAddress(job.getConfiguration(), host); ConfigHelper.setInputRpcPort(job.getConfiguration(), port); ConfigHelper.setOutputInitialAddress(job.getConfiguration(), host); ConfigHelper.setOutputRpcPort(job.getConfiguration(), port); ConfigHelper.setInputColumnFamily(job.getConfiguration(), casDemo, Words); //ConfigHelper.setOutputColumnFamily(job.getConfiguration(), casDemo, WordCount); ConfigHelper.setInputPartitioner(job.getConfiguration(), Murmur3Partitioner); //ConfigHelper.setOutputPartitioner(job.getConfiguration(), Murmur3Partitioner); SlicePredicate predicate = new SlicePredicate(); SliceRange sliceRange = new SliceRange(toByteBuffer(), toByteBuffer(), false, 20); predicate.setSlice_range(sliceRange); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); MapByteBuffer, Column valueClass = new TreeMapByteBuffer, Column(); JavaPairRDDByteBuffer, TreeMaplt;ByteBuffer, Column rdd = sparkContext .newAPIHadoopRDD(job.getConfiguration(), ColumnFamilyInputFormat.class.asSubclass(org.apache.hadoop.mapreduce.InputFormat.class), ByteBuffer.class, valueClass.getClass()); JavaPairRDDByteBuffer, Column pair = rdd.map (new PairFunction Tuple2 ByteBuffer, TreeMapByteBuffer, Column, ByteBuffer, Column () { @Override public Tuple2ByteBuffer, Column call( Tuple2ByteBuffer, TreeMaplt;ByteBuffer, Column paramT) throws Exception { System.out.println(ByteBufferUtil.string(paramT._1())); SetByteBuffer keys = paramT._2.keySet(); for(ByteBuffer key : keys) { System.out.println(\t + ByteBufferUtil.string(key)); Column col = paramT._2().get(key); System.out.println(\t + ByteBufferUtil.string(col.value())); } return null; //Add code } }); pair.collect(); System.out.println(Done.); } public static Tuple3String, String, String extractKey(String s) { Matcher m = null; ListString key = Collections.emptyList(); if (m.find()) { String ip = m.group(1); String user = m.group(3); String query = m.group(5); if (!user.equalsIgnoreCase(-)) { return new Tuple3String, String, String(ip, user, query); } } return new Tuple3String, String, String(null, null, null); } public static ByteBuffer toByteBuffer(String value) throws UnsupportedEncodingException { if (value == null) {
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-37082403 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user hsaputra commented on a diff in the pull request: https://github.com/apache/spark/pull/86#discussion_r10406142 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import scala.collection.mutable.ArrayBuffer + +private[spark] class SparkSubmitArguments(args: Array[String]) { --- End diff -- Please add class comment to explain why this class exist and how would it being used or relate to other classes. Few months from now it would make it easier to immediately understand how this class fits in the overall picture by just looking at the summary of the class than have to do a search of usage with IDE in the source repo =) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/100#issuecomment-37084570 Will this also work on Java 8? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/97#discussion_r10406957 --- Diff: python/pyspark/maxheapq.py --- @@ -0,0 +1,115 @@ +# -*- coding: latin-1 -*- + +Heap queue algorithm (a.k.a. priority queue). + +# Original code by Kevin O'Connor, augmented by Tim Peters and Raymond Hettinger --- End diff -- What license was this under? Not sure we can just include it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/100#issuecomment-37086194 ah got it, thanks. so asm 3.x will be on classpath wether we like it or not. and we remove all other asm dependencies here, except for a kryo version. will chill serialization still work this way? will it somehow find the kryo asm? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark-1163, Added missing Python RDD functions
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/92#issuecomment-37086543 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13057/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark-1163, Added missing Python RDD functions
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/92#issuecomment-37086542 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37086581 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-37086647 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/80#issuecomment-37086649 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/80#issuecomment-37086650 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/80#issuecomment-37086967 Thanks, merging this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782 Clean up for ASM dependency.
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/100#issuecomment-37087149 @koertkuipers so I looked at chill and they don't use ASM except inside of the ClosureCleaner (which they actually borrowed from Spark). Since we don't use chill's closurecleaner things should be alright at runtime. I did create a PR for chillw to do the same thing that we are doing in Spark [1], but we don't depend on that for things to work. [1] https://github.com/twitter/chill/pull/175 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/80 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-37087798 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37087805 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/80#issuecomment-37087803 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13060/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37087802 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1165 rdd.intersection in python and java
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/80#issuecomment-37087801 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37087806 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-37088861 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-37088862 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37088856 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13061/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Spark 1162 Implemented takeOrdered in pyspark.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/97#issuecomment-37088855 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/86#issuecomment-37088913 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/86#issuecomment-37088937 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13063/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/86#issuecomment-37088936 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1126. spark-app preliminary
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/86#issuecomment-37088962 Newest patch includes tests and doc. @pwendell, do you have a link to the addJar patch? If it's definitely going to happen, I'll take out the classloader stuff here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1193. Fix indentation in pom.xmls
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/91#issuecomment-37089032 Upmerged --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1194] Fix the same-RDD rule for cache r...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/96#issuecomment-37089290 @pwendell Regression test case added, also ensured that the old implementation fails on this test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1064
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/102 SPARK-1064 This reopens PR 649 from incubator-spark against the new repo You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-1064 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/102.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #102 commit 552fc04009e15547b315ff8eabbec5c4b1659002 Author: Sandy Ryza sa...@cloudera.com Date: 2014-02-19T00:30:06Z SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in Spark assembly commit 4380ad5b24096f4977bd2d97ff3fde808da4660f Author: Sandy Ryza sa...@cloudera.com Date: 2014-03-08T04:58:14Z sbt change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1128: set hadoop task properties when co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/101#issuecomment-37089804 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13062/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1193. Fix indentation in pom.xmls
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/91#issuecomment-37089817 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1193. Fix indentation in pom.xmls
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/91#issuecomment-37089816 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---