Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-21 Thread Haoyuan Li
Qingyang, Aha. Got it. 800MB data is pretty small. Loading from Tachyon does have a bit of extra overhead. But it will have more benefit when the data size is larger. Also, if you store the table in Tachyon, you can have different shark servers to query the data at the same time. For more

Pull request for PLAT-911

2014-07-21 Thread aaronjosephs
I found the spark Jira ticket, https://issues.apache.org/jira/browse/SPARK-911, and noticed that no one had done it. It seemed like a useful and interesting feature. I implemented a pull request here ( https://github.com/apache/spark/pull/1381) and was looking for some input on it( go easy on me

RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315: https://issues.apache.org/jira/browse/SPARK-2315 Supporting the drop method would make some operations convenient, however it forces computation of = 1 partition of the parent RDD, and so it would behave like a

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Andrew Ash
Personally I'd find the method useful -- I've often had a .csv file with a header row that I want to drop so filter it out, which touches all partitions anyway. I don't have any comments on the implementation quite yet though. On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - I too would like this feature. Erik's post makes sense. However, shouldn't the RDD also repartition itself after drop to effectively make use of cluster resources? My thinking is that in most use cases(*), one is dropping a small number of rows, and they are in

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but

Re: Examples have SparkContext improperly labeled?

2014-07-21 Thread RJ Nowling
Thanks for the clarification, Sandy. On Mon, Jul 21, 2014 at 12:52 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi RJ, Spark Shell instantiates a SparkContext for you named sc. In other apps, the user instantiates it themself and can give the variable whatever name they want, e.g. spark.

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation.

spark.executor.memory is not applicable when running unit test in Jenkins?

2014-07-21 Thread Nan Zhu
Hi, all I’m running some unit tests for my Spark applications in Jenkins it seems that even I set spark.executor.memory to 5g, the value I got with Runtime.getRuntime.maxMemory is still around 1G Is it saying that Jenkins limit the process to use no more than 1G (by default)? how to change

SPARK-1199 has been reverted in branch-1.0

2014-07-21 Thread Patrick Wendell
Just a note - there was a fix in branch-1.0 (and Spark 1.0.1) that introduced a new bug worse than the original one. https://issues.apache.org/jira/browse/SPARK-1199 The original bug was an issue with case classes defined in the repl. The fix caused a separate bug which broke most compound

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Erik Erlandson
- Original Message - Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate

Re: -1s on pull requests?

2014-07-21 Thread Shivaram Venkataraman
One way to do this would be to have a Github hook that parses -1s or +1s and posts a commit status [1] (like say Travis [2]) right next to the PR. Does anybody know of an existing tool that does this ? Shivaram [1] https://github.com/blog/1227-commit-status-api [2]

Re: -1s on pull requests?

2014-07-21 Thread Patrick Wendell
I've always operated under the assumption that if a commiter makes a comment on a PR, and that's not addressed, that should block the PR from being merged (even without a specific -1). I don't know of any cases where this has intentionally been violated, but I do think this happens accidentally

Re: -1s on pull requests?

2014-07-21 Thread Henry Saputra
There is ASF guidelines about Voting, including code review for patches: http://www.apache.org/foundation/voting.html Some ASF project do three +1 votes are required (to the issues like JIRA or Github PR in this case) for a patch unless it is tagged with lazy consensus [1] of like 48 hours. For

Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Nicholas Chammas
Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark needs a line or two about building and testing PySpark. A call out of run-tests, for example, would be helpful for new contributors to PySpark. Nick ​

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Nicholas Chammas
For the record, the triggering discussion is here https://github.com/apache/spark/pull/1505#issuecomment-49671550. I assumed that sbt/sbt test covers all the tests required before submitting a patch, and it appears that it doesn’t. ​ On Mon, Jul 21, 2014 at 6:42 PM, Nicholas Chammas

Re: Dynamic variables in Spark

2014-07-21 Thread Christopher Nguyen
Hi Neil, first off, I'm generally a sympathetic advocate for making changes to Spark internals to make it easier/better/faster/more awesome. In this case, I'm (a) not clear about what you're trying to accomplish, and (b) a bit worried about the proposed solution. On (a): it is stated that you

Suggestion for SPARK-1825

2014-07-21 Thread innowireless TaeYun Kim
Hi, A couple of month ago, I made a pull request to fix https://issues.apache.org/jira/browse/SPARK-1825. My pull request is here: https://github.com/apache/spark/pull/899 But that pull request has problems: l It is Hadoop 2.4.0+ only. It won't compile on the versions below it. l The

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Reynold Xin
I added an automated testing section: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting Can you take a look to see if it is what you had in mind? On Mon, Jul 21, 2014 at 3:54 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Nicholas Chammas
Looks good. Does sbt/sbt test cover the same tests as /dev/run-tests? I’m looking at step 5 under “Contributing Code”. Someone contributing to PySpark will want to be directed to run something in addition to (or instead of) sbt/sbt test, I believe. Nick ​ On Mon, Jul 21, 2014 at 11:43 PM,

Re: Contributing to Spark needs PySpark build/test instructions

2014-07-21 Thread Reynold Xin
I missed that bullet point. I removed that and just pointed it towards the instruction. On Mon, Jul 21, 2014 at 9:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looks good. Does sbt/sbt test cover the same tests as /dev/run-tests? I’m looking at step 5 under “Contributing Code”.

Re: Dynamic variables in Spark

2014-07-21 Thread Reynold Xin
Thanks for the thoughtful email, Neil and Christopher. If I understand this correctly, it seems like the dynamic variable is just a variant of the accumulator (a static one since it is a global object). Accumulators are already implemented using thread-local variables under the hood. Am I

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd really try hard to avoid a common drop/dropWhile because they can be expensive to do. Note that I think we will be adding this functionality (ignoring

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Sandy Ryza
It could make sense to add a skipHeader argument to SparkContext.textFile? On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote: If the purpose is for dropping csv headers, perhaps we don't really need a common drop and only one that drops the first line in a file? I'd

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Reynold Xin
Yes, that could work. But it is not as simple as just a binary flag. We might want to skip the first row for every file, or the header only for the first file. The former is not really supported out of the box by the input format I think? On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza