[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread tdas
GitHub user tdas opened a pull request:

https://github.com/apache/spark/pull/844

Updated scripts for auditing releases

- Added script to automatically generate change list CHANGES.txt
- Added test for verifying linking against maven distributions of 
`spark-sql` and `spark-hive`
- Added SBT projects for testing functionality of `spark-sql` and 
`spark-hive` 
- Fixed issues in existing tests that might have come up because of changes 
in Spark 1.0


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tdas/spark update-dev-scripts

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/844.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #844


commit e2e20b3fc4c5ca390e8c19f18f7a798c4b4b96c3
Author: Tathagata Das tathagata.das1...@gmail.com
Date:   2014-05-21T07:02:15Z

Updated tests for auditing releases.

commit 25090ba86833a38726b0bf00474929bdf90e8ac4
Author: Tathagata Das tathagata.das1...@gmail.com
Date:   2014-05-21T07:10:03Z

Added missing license




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43720269
  
@pwendell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43720431
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43720416
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43723792
  
LGTM - thanks TD this is great! Having SQL and Hive modules in there is 
awesome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43726777
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15114/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43726776
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Docs] Correct example of creating a new Spark...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/842#issuecomment-43726780
  
Thanks. I've merged this into master  branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1250] Fixed misleading comments in bin/...

2014-05-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/843


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43726813
  
Jenkins, test this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/841#issuecomment-43726915
  
He's on vacation this week so it might take a while for him to get back :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Minor] Move JdbcRDDSuite to the correct packa...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/839#issuecomment-43726958
  
Thanks. I've merged this into master  branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Docs] Correct example of creating a new Spark...

2014-05-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/842


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1880] [SQL] Eliminate unnecessary job e...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/825#issuecomment-43727197
  
We can easily add right outer join support to the hash join though. In 
general, the nested loop join performs very unfavorably compared with a hash 
join implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43727297
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43727320
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43727343
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43727532
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43727534
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15115/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43729433
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43729450
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43731406
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43731416
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43733545
  
`ensureFreeSpace` has 2 jobs. 1) iterate entries and select blocks to be 
dropped. 2) if to-be-dropped blocks can free enough space, mark them as 
dropping and return them to the caller.
`ensureFreeSpace` is called within putLock, so each thread will see the 
dropping flag modification(I will discuss flag resetting in exception handling 
later) and thus get different to-be-dropped blocks. And block reading don't 
need the dropping flag so no conflict there. Let's consider block removing and 
exception handling(reset dropping flag)
Job 1 of `ensureFreeSpace`(selecting) and removing are both synchronized by 
`entries`, so they must process by turn. 
If a block is removed first, then everything is OK. 
If a block is removed after Job 2 of `ensureFreeSpace`(marking) which is 
also synchronized by `entries`(in my modification), then the block will be 
dropped into disk and managed by diskStore, which I think is OK.
If a block is removed between selecting and marking, the marking will check 
if entry is null, so it's OK, too.
About exception handling, flag resetting is also synchronized by `entries`, 
so it won't process during selecting and marking.
If resetting happened before selecting, then selecting will be able to 
select these blocks and re-drop them.
If resetting happened after selecting, which means the selected 
to-be-dropped blocks won't include the resetted blocks, so there is no conflict.
Actually there are 3 place that write or read the dropping flag(selecting, 
marking and resetting) and they are all synchronized by `entries`, so I think 
we don't need to define the flag as volatile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43734386
  
Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43734389
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15117/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1820][tools] Make GenerateMimaIgnore @D...

2014-05-21 Thread nikhils05
GitHub user nikhils05 opened a pull request:

https://github.com/apache/spark/pull/845

 [SPARK-1820][tools] Make GenerateMimaIgnore @DeveloperApi annotation aware.

Solution for :  Add all the classes with DeveloperApi annotation in Mima 
excludes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nikhils05/spark tools

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #845


commit f495620d0f11e34dee67853dd0912fe20602d24d
Author: nikhil7sh nikhilsharmalnm...@gmail.ccom
Date:   2014-05-19T06:04:02Z

(SPARK-1820) Make GenerateMimaIgnore @DeveloperApi annotation aware

commit 6a7201b3bdbf917ea0054049eeaded13bfcbfd72
Author: nikhil7sh nikhilsharmalnm...@gmail.ccom
Date:   2014-05-21T09:16:10Z

[SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware

commit 8fa02d2c67f16556d7477f515603d472d2679a21
Author: nikhil7sh nikhilsharmalnm...@gmail.ccom
Date:   2014-05-21T09:20:04Z

[SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1820][tools] Make GenerateMimaIgnore @D...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/845#issuecomment-43734997
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/791#discussion_r12888619
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -243,10 +250,13 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
 val iterator = entries.entrySet().iterator()
 while (maxMemory - (currentMemory - selectedMemory)  space  
iterator.hasNext) {
   val pair = iterator.next()
-  val blockId = pair.getKey
-  if (rddToAdd.isEmpty || rddToAdd != getRddId(blockId)) {
-selectedBlocks += blockId
-selectedMemory += pair.getValue.size
+  val entry = pair.getValue
+  if (!entry.dropping) {
+val blockId = pair.getKey
+if (rddToAdd.isEmpty || rddToAdd != getRddId(blockId)) {
+  selectedBlocks += blockId
+  selectedMemory += entry.size
--- End diff --

As mentioned in comments, I misread the variable - this is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43736593
  

- With the latest commit, the issue with dropping flag is gone - which is 
great.

- There is a change of behavior w.r.t earlier code.
Whether the earlier code was the way it was intentionally or accidentally, 
I am not sure - will let @mateiz or others comment.

Essentially there are a few things here :

a) What happens if existing block is re-added. Looks like this was probably 
handled earlier also ?
I went up the call tree a bit, and did not look like this was prevented : 
but maybe I missed it. Any comments @mateiz ?

b) What happens if same block is added in parallel by two threads.
If this was supported usecase, then the current PR breaks this - it is 
possible for first thread to add it, and second to evict it from memory in case 
it was not possible to host both two copies in memory (according to the free 
space computed).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43737610
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43737611
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15116/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43738827
  
@mridulm I checked the code of BlockManager#doPut. 

val putBlockInfo = {
  val tinfo = new BlockInfo(level, tellMaster)
  // Do atomically !
  val oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo)

  if (oldBlockOpt.isDefined) {
if (oldBlockOpt.get.waitForReady()) {
  logWarning(Block  + blockId +  already exists on this machine; 
not re-adding it)
  return updatedBlocks
}

// TODO: So the block info exists - but previous attempt to load it 
(?) failed.
// What do we do now ? Retry on it ?
oldBlockOpt.get
  } else {
tinfo
  }
}

BlockManger will create a BlockInfo for the block to be added, and `val 
oldBlockOpt = blockInfo.putIfAbsent(blockId, tinfo)`, so if multi threads are 
adding same block, one thread will put the  BlockInfo successfully and the 
other will fail and stop to put.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43739167
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1880] [SQL] Eliminate unnecessary job e...

2014-05-21 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/825#issuecomment-43741634
  
@rxin Ah, you mean that we should add right/full outer join support in 
addition to #734?
I agree with the unfavorable performance of the nested loop join, so we 
should wait for being merged and then add the right/full outer join support at 
another issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1880] [SQL] Eliminate unnecessary job e...

2014-05-21 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/825#issuecomment-43742049
  
@rxin BTW, speaking of performance, could you please review the code #836?
I think this is a kind of blocker issue of join strategy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43742247
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43742248
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15118/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43744004
  
This seems really promising!! However, can you explain whether the 
following sequence of events is possible or not in `ensureFreeSpace`?

Both thread 1 and thread 2 wants to insert blocks of 100 bytes. Existing 
blocks include block A and block B of 100 bytes each, and the total capacity is 
200 bytes. Next, 

- Thread 1 selects block A (not marked yet) and exits the 
`entries.synchronized { // select }`
- Thread 2 selects block A as well (not marked yet) and exist 
`entries.synchronized { // select }` 
- Thread 1 enters `entries.synchronized { // mark }` and marks block A to 
be dropped
- Thread 2 also enters `entries.synchronized { // mark } ` and marks block 
A to be dropped again (this seems to be possible since there is no double check 
to see whether each block has already been marked or not)
- Thread 1 then drops Block A to disk
- Thread 2 tries to drop Block A to disk as well, but since it is already 
dropped, no more action is taken.
- Both threads think that 100 bytes have been cleared. Hence 2 x 100 bytes 
are inserted after dropping only 100 bytes. 

Is this sequence possible?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43744225
  
@cloud-fan there are multiple calls to memoryStore to directly put a block 
- not just from external addition.
So looking at only doPut might not help ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43744304
  
@tdas there is a dropping flag which prevents this.
Or did I misunderstand your query ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43744657
  
@tdas yes - thread 1 should set A's dropping to true; so thread 2 should 
not select it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43744797
  
Is that so? Since selection and marking are occurring into different 
`entries.synchronized` blocks, selection and marking are not atomic together. 
So two threads can select the same block, before marking that block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated scripts for auditing releases

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/844#issuecomment-43744964
  
Jenkins, retest this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43749375
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43749356
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43751967
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43751952
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43752165
  
@tdas you missed an important thing. `trToPut` call `ensureFreeSpace` 
within the putLock, so one thread have to wait another thread done both 
selecting and marking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43753931
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15119/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43753928
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43756438
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1776] Have Spark's SBT build read depen...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/772#issuecomment-43756441
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15120/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43756945
  
@tdas as @cloud-fan stated, the code uses the implementation detail that 
the private method is always called within context of a tryToPut lock - and not 
called by anyone else. I dont like the fact that we have locking state spread 
out like this, but then this is how it was already I guess ...
Maybe we should at best annotate the method ? And possibly assert that it 
is within tryToPut lock ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43775013
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43775032
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43778259
  
@tdas @mridulm what about we moving the `putLock.synchronized` into 
`ensureFreeSpace ` and let `tryToPut` call `ensureFreeSpace ` directly? I think 
it will be more clear this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43783643
  
@cloud-fan makes more sense.
Also, please rename it to something more appropriate (since it is not 
longer trying to put within that block !)

@tdas, can you also comment about the usecases/flows I mentioned above ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

2014-05-21 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/841#issuecomment-43786603
  
@rxin thanks for the heads up. I appreciate help from anyone to help burn 
down my open PRs, the oldest being over a month old.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43788086
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15121/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: add support for left semi join

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/837#issuecomment-43788084
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-21 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43790944
  
@mateiz, @mengxr 
I am using [the code](https://github.com/witgo/spark/compare/cachePoint) to 
test ALS.
A brief description of the test:

| Item | Description |
| - | --- |
|cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`|
|data| `700 million`|
|code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`|
|time|`12.5 h`|
|shuffle write| `4.72T`|
|largest local dir|`200G`|


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/791#issuecomment-43795376
  
@cloud-fan @mridulm Aaah, I get it now. I knew I was missing something! I 
agree with @mridulm that this is a tricky lock structure and needs to be 
cleaner. Putting the `putLock.synchronized` inside the `ensureFreeSpace` is 
definitely better, as it co-locates the important locks together in the code. 
Less likely to be missed as I did. Maybe rename to `ensureFreeSpaceLock`? Or 
how about synchronizing on `this` ( that is, `def ensureFreeSpace(...): 
ReturnType = synchronized { ... } ` ?

Also, please add a few more lines in the scaladoc of the `ensureFreeSpace` 
explaining this lock structure and the high-level selection and marking 
steps. You could give the higher level flow (select, mark, drop, exception 
handling) in the scala doc of `MemoryStore`,

On a related note, have you run any long, rigorous test on this to make 
sure that 
(1) this new lock structure is not accidentally causing deadlocks (has 
happened before and was found only by running a long test)?
(2) the memory limit is maintained all the time (to catch any race 
condition like i suggested even if remotely possible)? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1896] Respect spark.master before MASTE...

2014-05-21 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/846

[SPARK-1896] Respect spark.master before MASTER in REPL

The hierarchy for the shell is as follows:
```
MASTER  --master  spark.master (spark-defaults.conf)
```
This is inconsistent with the way we run normal applications, which is:
```
--master  spark.master (spark-defaults.conf)  MASTER
```

I was trying to run a shell locally on a standalone cluster launched 
through the ec2 scripts, which automatically set `MASTER` in spark-env.sh. It 
was surprising to me that `--master` didn't take effect.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark shell-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/846.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #846


commit 2cb81c9ed313976e23ae169c20bc930efc259756
Author: Andrew Or andrewo...@gmail.com
Date:   2014-05-21T18:43:36Z

Respect spark.master before MASTER in REPL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1896] Respect spark.master before MASTE...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/846#issuecomment-43798409
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1896] Respect spark.master before MASTE...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/846#issuecomment-43798432
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Typo] Stoped - Stopped

2014-05-21 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/847

[Typo] Stoped - Stopped



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark yarn-typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #847


commit c1906afd0b8946bf308b388f1928779e50d4fa5b
Author: Andrew Or andrewo...@gmail.com
Date:   2014-05-21T18:50:44Z

Stoped - Stopped




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/697#issuecomment-43798939
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/697#issuecomment-43798959
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Typo] Stoped - Stopped

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/847#issuecomment-43798932
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Typo] Stoped - Stopped

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/847#issuecomment-43798958
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread ahirreddy
Github user ahirreddy commented on the pull request:

https://github.com/apache/spark/pull/697#issuecomment-43799163
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/697#issuecomment-43799234
  
Thanks. I will merge once Travis returns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/848

[SPARK-1870] Make spark-submit --jars work in yarn-cluster mode.

Sent secondary jars to distributed cache of all containers and add the 
cached jars to classpath before executors start.

`spark-submit --jars` also works in standalone server and `yarn-client`. 
Thanks for @andrewor14 for testing!

I removed Doesn't work for drivers in standalone mode with cluster 
deploy mode. from `spark-submit`'s help message, though we haven't tested 
mesos yet.

CC: @dbtsai @sryza

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark yarn-classpath

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/848.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #848


commit dc3c825934cbd62566d09d3f2b4334dcc444879a
Author: Xiangrui Meng m...@databricks.com
Date:   2014-05-21T17:51:43Z

add secondary jars to classpath in yarn

commit 3e7e1c4a2fe1a9d8512c19e56df91b34bea58108
Author: Xiangrui Meng m...@databricks.com
Date:   2014-05-21T18:21:09Z

use sparkConf instead of hadoop conf

commit 11e535434940d0809bd8c1380b2d4a92d87ebb6a
Author: Xiangrui Meng m...@databricks.com
Date:   2014-05-21T18:45:25Z

minor changes

commit 65e04ad8296969445e4ecfaa8921d55fe1e39c74
Author: Xiangrui Meng m...@databricks.com
Date:   2014-05-21T18:52:02Z

update spark-submit help message and add a comment for yarn-client




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Typo] Stoped - Stopped

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/847#issuecomment-43799605
  
Thanks. I've merged this into master  branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43800111
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

2014-05-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/841#discussion_r12916889
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -274,6 +274,10 @@ class SchemaRDD(
   seed: Long) =
 new SchemaRDD(sqlContext, Sample(fraction, withReplacement, seed, 
logicalPlan))
 
+  override def count(): Long = {
--- End diff --

Do you mind adding javadoc for this? Just explain different from RDD 
count's, SchemaRDD count actually invokes the optimizer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/719#issuecomment-43803035
  

@ankurdave is this good now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-21 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/719#issuecomment-43803060
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1896] Respect spark.master before MASTE...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/846#issuecomment-43803106
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1896] Respect spark.master before MASTE...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/846#issuecomment-43803107
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15122/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/719#issuecomment-43803413
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/719#issuecomment-43803425
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Typo] Stoped - Stopped

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/847#issuecomment-43803493
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Typo] Stoped - Stopped

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/847#issuecomment-43803495
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15123/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/697#issuecomment-43803492
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/719#issuecomment-43804545
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15126/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43804541
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15125/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43804540
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/719#issuecomment-43804544
  
Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/697


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1519] Support minPartitions param of wh...

2014-05-21 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/697#issuecomment-43810385
  
@rxin  @ahirreddy , thanks for the quick response!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

2014-05-21 Thread kanzhang
Github user kanzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/841#discussion_r12921105
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -274,6 +274,10 @@ class SchemaRDD(
   seed: Long) =
 new SchemaRDD(sqlContext, Sample(fraction, withReplacement, seed, 
logicalPlan))
 
+  override def count(): Long = {
--- End diff --

Sure, will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/848#discussion_r12921552
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -479,37 +485,24 @@ object ClientBase {
 
 extraClassPath.foreach(addClasspathEntry)
 
-addClasspathEntry(Environment.PWD.$())
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse().split(,)
 // Normally the users app.jar is last in case conflicts with spark jars
 if (sparkConf.get(spark.yarn.user.classpath.first, 
false).toBoolean) {
--- End diff --

What's difference between `spark.yarn.user.classpath.first` and 
`spark.files.userClassPathFirst `? For me, it seems to be the same thing with 
two different configuration. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/848#discussion_r12921709
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -479,37 +485,24 @@ object ClientBase {
 
 extraClassPath.foreach(addClasspathEntry)
 
-addClasspathEntry(Environment.PWD.$())
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse().split(,)
 // Normally the users app.jar is last in case conflicts with spark jars
 if (sparkConf.get(spark.yarn.user.classpath.first, 
false).toBoolean) {
--- End diff --

PS, in line 47,   * 1. In standalone mode, it will launch an 
[[org.apache.spark.deploy.yarn.ApplicationMaster]]
should it be cluster mode now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43812877
  
Thanks. It looks great for me, and better than my patch.

cachedSecondaryJarLinks.foreach(addPwdClasspathEntry) is not needed since 
we have 
addPwdClasspathEntry(*). But later, we may change the priority of the 
jars since we explicitly add them.

This patch also works for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43814337
  
The symbolic links may not be under the PWD. That is why it didn't work 
before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43814642
  
It works under driver before, so the major issue is those files are not in 
executor's distributed cache. But I like the idea to add them explicitly so 
we'll not miss anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/848#issuecomment-43815204
  
Yes, we can also control the ordering in this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1870] Make spark-submit --jars work in ...

2014-05-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/848#discussion_r12923791
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -479,37 +485,24 @@ object ClientBase {
 
 extraClassPath.foreach(addClasspathEntry)
 
-addClasspathEntry(Environment.PWD.$())
+val cachedSecondaryJarLinks =
+  
sparkConf.getOption(CONF_SPARK_YARN_SECONDARY_JARS).getOrElse().split(,)
 // Normally the users app.jar is last in case conflicts with spark jars
 if (sparkConf.get(spark.yarn.user.classpath.first, 
false).toBoolean) {
--- End diff --

`spark.files.userClassPath` is a global configuration that controls the 
ordering of dynamically added jars, while `spark.yarn.user.classpath.first` is 
only for YARN. I agree it is a little confusing, but this is independent of 
this PR. We can create a new JIRA for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >