Re: Foundation policy on releases and Spark nightly builds

2015-07-19 Thread Sean Owen
I am going to make an edit to the download page on the web site to start, as that much seems uncontroversial. Proposed change: Reorder sections to put developer-oriented sections at the bottom, including the info on nightly builds: Download Spark Link with Spark All Releases Spark Source

KinesisStreamSuite failing in master branch

2015-07-19 Thread Ted Yu
Hi, I noticed that KinesisStreamSuite fails for both hadoop profiles in master Jenkins builds. From https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/3011/console : KinesisStreamSuite:*** RUN ABORTED *** java.lang.AssertionError:

Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Tathagata Das
The PR to fix this is out. https://github.com/apache/spark/pull/7519 On Sun, Jul 19, 2015 at 6:41 PM, Tathagata Das t...@databricks.com wrote: I am taking care of this right now. On Sun, Jul 19, 2015 at 6:08 PM, Patrick Wendell pwend...@gmail.com wrote: I think we should just revert this

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu

Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Patrick Wendell
I think we should just revert this patch on all affected branches. No reason to leave the builds broken until a fix is in place. - Patrick On Sun, Jul 19, 2015 at 6:03 PM, Josh Rosen rosenvi...@gmail.com wrote: Yep, I emailed TD about it; I think that we may need to make a change to the pull

Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Josh Rosen
Yep, I emailed TD about it; I think that we may need to make a change to the pull request builder to fix this. Pending that, we could just revert the commit that added this. On Sun, Jul 19, 2015 at 5:32 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I noticed that KinesisStreamSuite fails for both

What is the reason there is no out of the box sortByValue API?

2015-07-19 Thread suyog choudhari

Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Tathagata Das
I am taking care of this right now. On Sun, Jul 19, 2015 at 6:08 PM, Patrick Wendell pwend...@gmail.com wrote: I think we should just revert this patch on all affected branches. No reason to leave the builds broken until a fix is in place. - Patrick On Sun, Jul 19, 2015 at 6:03 PM, Josh

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd =

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
The user gets to choose what they want to reside in memory. If they call rdd.cache() on the original RDD, it will be in memory. If they call rdd.cache() on the compact RDD, it will be in memory. If cache() is called on both, they'll both be in memory. -Sandy On Sun, Jul 19, 2015 at 11:09 AM,

Re: Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Sorry, maybe I am saying something completely wrong... we have a stream, we digitize it to created rdd. rdd in this case will be just array of any. than we apply transformation to create new grouped rdd and GC should remove original rdd from memory(if we won't persist it). Will we have GC step in

Re: Foundation policy on releases and Spark nightly builds

2015-07-19 Thread Patrick Wendell
Sean B., Thank you for giving a thorough reply. I will work with Sean O. and see what we can change to make us more in line with the stated policy. I did some research and it appears that some time between October [1] and December [2] 2006, this page was modified to include stricter policy

Re: Foundation policy on releases and Spark nightly builds

2015-07-19 Thread Patrick Wendell
Hey Sean, One other thing I'd be okay doing is moving the main text about nightly builds to the wiki and just have header called Nightly builds at the end of the downloads page that says For developers, Spark maintains nightly builds. More information is available on the [Spark developer

Re: Compact RDD representation

2015-07-19 Thread Juan Rodríguez Hortalá
Hi, My two cents is that that could be interesting if all RDD and pair RDD operations would be lifted to work on groupedRDD. For example as suggested a map on grouped RDDs would be more efficient if the original RDD had lots of duplicate entries, but for RDDs with little repetitions I guess you

Re: Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Thanks for answer! Could you please answer for one more question? Will we have in memory original rdd and grouped rdd in the same time? 2015-07-19 21:04 GMT+03:00 Sandy Ryza sandy.r...@cloudera.com: Edit: the first line should read: val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _) On

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
This functionality already basically exists in Spark. To create the grouped RDD, one can run: val groupedRdd = rdd.reduceByKey(_ + _) To get it back into the original form: groupedRdd.flatMap(x = List.fill(x._1)(x._2)) -Sandy -Sandy On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
I only used client mode both 1.3 and 1.4 versions on mesos. I skimmed through https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcher.scala. I would actually backport the Cluster Mode feature. Sorry, I don't have an answer for this. On

Compact RDD representation

2015-07-19 Thread Сергей Лихоман
Hi, I am looking for suitable issue for Master Degree project(it sounds like scalability problems and improvements for spark streaming) and seems like introduction of grouped RDD(for example: don't store Spark, Spark, Spark, instead store (Spark, 3)) can: 1. Reduce memory needed for RDD

Re: Compact RDD representation

2015-07-19 Thread Sandy Ryza
In the Spark model, constructing an RDD does not mean storing all its contents in memory. Rather, an RDD is a description of a dataset that enables iterating over its contents, record by record (in parallel). The only time the full contents of an RDD are stored in memory is when a user