Matrix Multiplication of two RDD[Array[Double]]'s
Hi I am currently implementing an algorithm involving matrix multiplication. Basically, I have matrices represented as RDD[Array[Double]]. For example, If I have A:RDD[Array[Double]] and B:RDD[Array[Double]] and what would be the most efficient way to get C = A * B Both A and B are large, so it would not be possible to save either of them in memory. Thanks a lot for your help! Liquan
Re: can RDD be shared across mutil spark applications?
Qing Yang, Andy is correct in answering your direct question. At the same time, depending on your context, you may be able to apply a pattern where you turn the single Spark application into a service, and multiple clients if that service can indeed share access to the same RDDs. Several groups have built apps based on this pattern, and we will also show something with this behavior at the upcoming Spark Summit (multiple users collaborating on named DDFs with the same underlying RDDs). Sent while mobile. Pls excuse typos etc. On May 18, 2014 9:40 AM, "Andy Konwinski" wrote: > RDDs cannot currently be shared across multiple SparkContexts without using > something like the Tachyon project (which is a separate project/codebase). > > Andy > On May 16, 2014 2:14 PM, "qingyang li" wrote: > > > > > >
Re: can RDD be shared across mutil spark applications?
RDDs cannot currently be shared across multiple SparkContexts without using something like the Tachyon project (which is a separate project/codebase). Andy On May 16, 2014 2:14 PM, "qingyang li" wrote: > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
Yup, this is a good point, the interface includes stuff like launch scripts and environment variables. However I do think that the current features of spark-submit can all be supported in future releases. We’ll definitely have a very strict standard for modifying these later on. Matei On May 17, 2014, at 2:05 PM, Mridul Muralidharan wrote: > I would make the case for interface stability not just api stability. > Particularly given that we have significantly changed some of our > interfaces, I want to ensure developers/users are not seeing red flags. > > Bugs and code stability can be addressed in minor releases if found, but > behavioral change and/or interface changes would be a much more invasive > issue for our users. > > Regards > Mridul > On 18-May-2014 2:19 am, "Matei Zaharia" wrote: > >> As others have said, the 1.0 milestone is about API stability, not about >> saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner >> users can confidently build on Spark, knowing that the application they >> build today will still run on Spark 1.9.9 three years from now. This is >> something that I’ve seen done badly (and experienced the effects thereof) >> in other big data projects, such as MapReduce and even YARN. The result is >> that you annoy users, you end up with a fragmented userbase where everyone >> is building against a different version, and you drastically slow down >> development. >> >> With a project as fast-growing as fast-growing as Spark in particular, >> there will be new bugs discovered and reported continuously, especially in >> the non-core components. Look at the graph of # of contributors in time to >> Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” >> changed when we started merging each patch as a single commit). This is not >> slowing down, and we need to have the culture now that we treat API >> stability and release numbers at the level expected for a 1.0 project >> instead of having people come in and randomly change the API. >> >> I’ll also note that the issues marked “blocker” were marked so by their >> reporters, since the reporter can set the priority. I don’t consider stuff >> like parallelize() not partitioning ranges in the same way as other >> collections a blocker — it’s a bug, it would be good to fix it, but it only >> affects a small number of use cases. Of course if we find a real blocker >> (in particular a regression from a previous version, or a feature that’s >> just completely broken), we will delay the release for that, but at some >> point you have to say “okay, this fix will go into the next maintenance >> release”. Maybe we need to write a clear policy for what the issue >> priorities mean. >> >> Finally, I believe it’s much better to have a culture where you can make >> releases on a regular schedule, and have the option to make a maintenance >> release in 3-4 days if you find new bugs, than one where you pile up stuff >> into each release. This is what much large project than us, like Linux, do, >> and it’s the only way to avoid indefinite stalling with a large contributor >> base. In the worst case, if you find a new bug that warrants immediate >> release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in >> three days with just your bug fix in it). And if you find an API that you’d >> like to improve, just add a new one and maybe deprecate the old one — at >> some point we have to respect our users and let them know that code they >> write today will still run tomorrow. >> >> Matei >> >> On May 17, 2014, at 10:32 AM, Kan Zhang wrote: >> >>> +1 on the running commentary here, non-binding of course :-) >>> >>> >>> On Sat, May 17, 2014 at 8:44 AM, Andrew Ash >> wrote: >>> +1 on the next release feeling more like a 0.10 than a 1.0 On May 17, 2014 4:38 AM, "Mridul Muralidharan" >> wrote: > I had echoed similar sentiments a while back when there was a >> discussion > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > changes, add missing functionality, go through a hardening release >> before > 1.0 > > But the community preferred a 1.0 :-) > > Regards, > Mridul > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: >> >> On this note, non-binding commentary: >> >> Releases happen in local minima of change, usually created by >> internally enforced code freeze. Spark is incredibly busy now due to >> external factors -- recently a TLP, recently discovered by a large new >> audience, ease of contribution enabled by Github. It's getting like >> the first year of mainstream battle-testing in a month. It's been very >> hard to freeze anything! I see a number of non-trivial issues being >> reported, and I don't think it has been possible to triage all of >> them, even. >> >> Given the high rate of change, my instinct would have been to release >> 0.10.0 now
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I don't understand. We never said that interfaces wouldn't change from 0.9 to 1.0. What we are committing to is stability going forward from the 1.0.0 baseline. Nobody is disputing that backward-incompatible behavior or interface changes would be an issue post-1.0.0. The question is whether there is anything apparent now that is expected to require such disruptive changes if we were to commit to the current release candidate as our guaranteed 1.0.0 baseline. On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan wrote: > I would make the case for interface stability not just api stability. > Particularly given that we have significantly changed some of our > interfaces, I want to ensure developers/users are not seeing red flags. > > Bugs and code stability can be addressed in minor releases if found, but > behavioral change and/or interface changes would be a much more invasive > issue for our users. > > Regards > Mridul > On 18-May-2014 2:19 am, "Matei Zaharia" wrote: > > > As others have said, the 1.0 milestone is about API stability, not about > > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the > sooner > > users can confidently build on Spark, knowing that the application they > > build today will still run on Spark 1.9.9 three years from now. This is > > something that I’ve seen done badly (and experienced the effects thereof) > > in other big data projects, such as MapReduce and even YARN. The result > is > > that you annoy users, you end up with a fragmented userbase where > everyone > > is building against a different version, and you drastically slow down > > development. > > > > With a project as fast-growing as fast-growing as Spark in particular, > > there will be new bugs discovered and reported continuously, especially > in > > the non-core components. Look at the graph of # of contributors in time > to > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; > “commits” > > changed when we started merging each patch as a single commit). This is > not > > slowing down, and we need to have the culture now that we treat API > > stability and release numbers at the level expected for a 1.0 project > > instead of having people come in and randomly change the API. > > > > I’ll also note that the issues marked “blocker” were marked so by their > > reporters, since the reporter can set the priority. I don’t consider > stuff > > like parallelize() not partitioning ranges in the same way as other > > collections a blocker — it’s a bug, it would be good to fix it, but it > only > > affects a small number of use cases. Of course if we find a real blocker > > (in particular a regression from a previous version, or a feature that’s > > just completely broken), we will delay the release for that, but at some > > point you have to say “okay, this fix will go into the next maintenance > > release”. Maybe we need to write a clear policy for what the issue > > priorities mean. > > > > Finally, I believe it’s much better to have a culture where you can make > > releases on a regular schedule, and have the option to make a maintenance > > release in 3-4 days if you find new bugs, than one where you pile up > stuff > > into each release. This is what much large project than us, like Linux, > do, > > and it’s the only way to avoid indefinite stalling with a large > contributor > > base. In the worst case, if you find a new bug that warrants immediate > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in > > three days with just your bug fix in it). And if you find an API that > you’d > > like to improve, just add a new one and maybe deprecate the old one — at > > some point we have to respect our users and let them know that code they > > write today will still run tomorrow. > > > > Matei > > > > On May 17, 2014, at 10:32 AM, Kan Zhang wrote: > > > > > +1 on the running commentary here, non-binding of course :-) > > > > > > > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash > > wrote: > > > > > >> +1 on the next release feeling more like a 0.10 than a 1.0 > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" > > wrote: > > >> > > >>> I had echoed similar sentiments a while back when there was a > > discussion > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the > api > > >>> changes, add missing functionality, go through a hardening release > > before > > >>> 1.0 > > >>> > > >>> But the community preferred a 1.0 :-) > > >>> > > >>> Regards, > > >>> Mridul > > >>> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > On this note, non-binding commentary: > > > > Releases happen in local minima of change, usually created by > > internally enforced code freeze. Spark is incredibly busy now due to > > external factors -- recently a TLP, recently discovered by a large > new > > audience, ease of contribution enabled by Github. It's getting like > > the first year of mainstream battle-testing
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
While developers may appreciate "1.0 == API stability," I'm not sure that will be the understanding of the VP who gives the green light to a Spark-based development effort. I fear a bug that silently produces erroneous results will be perceived like the FDIV bug, but in this case without the momentum of an existing large installed base and with a number of "competitors" (GridGain, H20, Stratosphere). Despite the stated intention of API stability, the perception (which becomes the reality) of "1.0" is that it's ready for production use -- not bullet-proof, but also not with known silent generation of erroneous results. Exceptions and crashes are much more tolerated than silent corruption of data. The result may be a reputation of the Spark team unconcerned about data integrity. I ran into (and submitted) https://issues.apache.org/jira/browse/SPARK-1817 due to the lack of zipWithIndex(). zip() with a self-created partitioned range was the way I was trying to number with IDs a collection of nodes in preparation for the GraphX constructor. For the record, it was a frequent Spark committer who escalated it to "blocker"; I did not submit it as such. Partitioning a Scala range isn't just a toy example; it has a real-life use. I also wonder about the REPL. Cloudera, for example, touts it as key to making Spark a "crossover tool" that Data Scientists can also use. The REPL can be considered an API of sorts -- not a traditional Scala or Java API, of course, but the "API" that a human data analyst would use. With the Scala REPL exhibiting some of the same bad behaviors as the Spark REPL, there is a question of whether the Spark REPL can even be fixed. If the Spark REPL has to be eliminated after 1.0 due to an inability to repair it, that would constitute API instability. On Saturday, May 17, 2014 2:49 PM, Matei Zaharia wrote: As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is something that I’ve seen done badly (and experienced the effects thereof) in other big data projects, such as MapReduce and even YARN. The result is that you annoy users, you end up with a fragmented userbase where everyone is building against a different version, and you drastically slow down development. With a project as fast-growing as fast-growing as Spark in particular, there will be new bugs discovered and reported continuously, especially in the non-core components. Look at the graph of # of contributors in time to Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when we started merging each patch as a single commit). This is not slowing down, and we need to have the culture now that we treat API stability and release numbers at the level expected for a 1.0 project instead of having people come in and randomly change the API. I’ll also note that the issues marked “blocker” were marked so by their reporters, since the reporter can set the priority. I don’t consider stuff like parallelize() not partitioning ranges in the same way as other collections a blocker — it’s a bug, it would be good to fix it, but it only affects a small number of use cases. Of course if we find a real blocker (in particular a regression from a previous version, or a feature that’s just completely broken), we will delay the release for that, but at some point you have to say “okay, this fix will go into the next maintenance release”. Maybe we need to write a clear policy for what the issue priorities mean. Finally, I believe it’s much better to have a culture where you can make releases on a regular schedule, and have the option to make a maintenance release in 3-4 days if you find new bugs, than one where you pile up stuff into each release. This is what much large project than us, like Linux, do, and it’s the only way to avoid indefinite stalling with a large contributor base. In the worst case, if you find a new bug that warrants immediate release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug fix in it). And if you find an API that you’d like to improve, just add a new one and maybe deprecate the old one — at some point we have to respect our users and let them know that code they write today will still run tomorrow. Matei On May 17, 2014, at 10:32 AM, Kan Zhang wrote: > +1 on the running commentary here, non-binding of course :-) > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash wrote: > >> +1 on the next release feeling more like a 0.10 than a 1.0 >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" wrote: >> >>> I had echoed similar sentiments a while back when there was a discussion >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api >>> changes
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I would make the case for interface stability not just api stability. Particularly given that we have significantly changed some of our interfaces, I want to ensure developers/users are not seeing red flags. Bugs and code stability can be addressed in minor releases if found, but behavioral change and/or interface changes would be a much more invasive issue for our users. Regards Mridul On 18-May-2014 2:19 am, "Matei Zaharia" wrote: > As others have said, the 1.0 milestone is about API stability, not about > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner > users can confidently build on Spark, knowing that the application they > build today will still run on Spark 1.9.9 three years from now. This is > something that I’ve seen done badly (and experienced the effects thereof) > in other big data projects, such as MapReduce and even YARN. The result is > that you annoy users, you end up with a fragmented userbase where everyone > is building against a different version, and you drastically slow down > development. > > With a project as fast-growing as fast-growing as Spark in particular, > there will be new bugs discovered and reported continuously, especially in > the non-core components. Look at the graph of # of contributors in time to > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” > changed when we started merging each patch as a single commit). This is not > slowing down, and we need to have the culture now that we treat API > stability and release numbers at the level expected for a 1.0 project > instead of having people come in and randomly change the API. > > I’ll also note that the issues marked “blocker” were marked so by their > reporters, since the reporter can set the priority. I don’t consider stuff > like parallelize() not partitioning ranges in the same way as other > collections a blocker — it’s a bug, it would be good to fix it, but it only > affects a small number of use cases. Of course if we find a real blocker > (in particular a regression from a previous version, or a feature that’s > just completely broken), we will delay the release for that, but at some > point you have to say “okay, this fix will go into the next maintenance > release”. Maybe we need to write a clear policy for what the issue > priorities mean. > > Finally, I believe it’s much better to have a culture where you can make > releases on a regular schedule, and have the option to make a maintenance > release in 3-4 days if you find new bugs, than one where you pile up stuff > into each release. This is what much large project than us, like Linux, do, > and it’s the only way to avoid indefinite stalling with a large contributor > base. In the worst case, if you find a new bug that warrants immediate > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in > three days with just your bug fix in it). And if you find an API that you’d > like to improve, just add a new one and maybe deprecate the old one — at > some point we have to respect our users and let them know that code they > write today will still run tomorrow. > > Matei > > On May 17, 2014, at 10:32 AM, Kan Zhang wrote: > > > +1 on the running commentary here, non-binding of course :-) > > > > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash > wrote: > > > >> +1 on the next release feeling more like a 0.10 than a 1.0 > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" > wrote: > >> > >>> I had echoed similar sentiments a while back when there was a > discussion > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > >>> changes, add missing functionality, go through a hardening release > before > >>> 1.0 > >>> > >>> But the community preferred a 1.0 :-) > >>> > >>> Regards, > >>> Mridul > >>> > >>> On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > On this note, non-binding commentary: > > Releases happen in local minima of change, usually created by > internally enforced code freeze. Spark is incredibly busy now due to > external factors -- recently a TLP, recently discovered by a large new > audience, ease of contribution enabled by Github. It's getting like > the first year of mainstream battle-testing in a month. It's been very > hard to freeze anything! I see a number of non-trivial issues being > reported, and I don't think it has been possible to triage all of > them, even. > > Given the high rate of change, my instinct would have been to release > 0.10.0 now. But won't it always be very busy? I do think the rate of > significant issues will slow down. > > Version ain't nothing but a number, but if it has any meaning it's the > semantic versioning meaning. 1.0 imposes extra handicaps around > striving to maintain backwards-compatibility. That may end up being > bent to fit in important changes that are going to be required in this > continuing period of c
Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
We do actually have replicated StorageLevels in Spark. You can use MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom replication factor. BTW you guys should probably have this discussion on the JIRA rather than the dev list; I think the replies somehow ended up on the dev list. Matei On May 17, 2014, at 1:36 AM, Mridul Muralidharan wrote: > We don't have 3x replication in spark :-) > And if we use replicated storagelevel, while decreasing odds of failure, it > does not eliminate it (since we are not doing a great job with replication > anyway from fault tolerance point of view). > Also it does take a nontrivial performance hit with replicated levels. > > Regards, > Mridul > On 17-May-2014 8:16 am, "Xiangrui Meng" wrote: > >> With 3x replication, we should be able to achieve fault tolerance. >> This checkPointed RDD can be cleared if we have another in-memory >> checkPointed RDD down the line. It can avoid hitting disk if we have >> enough memory to use. We need to investigate more to find a good >> solution. -Xiangrui >> >> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan >> wrote: >>> Effectively this is persist without fault tolerance. >>> Failure of any node means complete lack of fault tolerance. >>> I would be very skeptical of truncating lineage if it is not reliable. >>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" wrote: >>> Xiangrui Meng created SPARK-1855: Summary: Provide memory-and-local-disk RDD checkpointing Key: SPARK-1855 URL: https://issues.apache.org/jira/browse/SPARK-1855 Project: Spark Issue Type: New Feature Components: MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Xiangrui Meng Checkpointing is used to cut long lineage while maintaining fault tolerance. The current implementation is HDFS-based. Using the BlockRDD >> we can create in-memory-and-local-disk (with replication) checkpoints that >> are not as reliable as HDFS-based solution but faster. It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252) >>
Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
BTW for what it’s worth I agree this is a good option to add, the only tricky thing will be making sure the checkpoint blocks are not garbage-collected by the block store. I don’t think they will be though. Matei On May 17, 2014, at 2:20 PM, Matei Zaharia wrote: > We do actually have replicated StorageLevels in Spark. You can use > MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom > replication factor. > > BTW you guys should probably have this discussion on the JIRA rather than the > dev list; I think the replies somehow ended up on the dev list. > > Matei > > On May 17, 2014, at 1:36 AM, Mridul Muralidharan wrote: > >> We don't have 3x replication in spark :-) >> And if we use replicated storagelevel, while decreasing odds of failure, it >> does not eliminate it (since we are not doing a great job with replication >> anyway from fault tolerance point of view). >> Also it does take a nontrivial performance hit with replicated levels. >> >> Regards, >> Mridul >> On 17-May-2014 8:16 am, "Xiangrui Meng" wrote: >> >>> With 3x replication, we should be able to achieve fault tolerance. >>> This checkPointed RDD can be cleared if we have another in-memory >>> checkPointed RDD down the line. It can avoid hitting disk if we have >>> enough memory to use. We need to investigate more to find a good >>> solution. -Xiangrui >>> >>> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan >>> wrote: Effectively this is persist without fault tolerance. Failure of any node means complete lack of fault tolerance. I would be very skeptical of truncating lineage if it is not reliable. On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" wrote: > Xiangrui Meng created SPARK-1855: > > >Summary: Provide memory-and-local-disk RDD checkpointing >Key: SPARK-1855 >URL: https://issues.apache.org/jira/browse/SPARK-1855 >Project: Spark > Issue Type: New Feature > Components: MLlib, Spark Core > Affects Versions: 1.0.0 > Reporter: Xiangrui Meng > > > Checkpointing is used to cut long lineage while maintaining fault > tolerance. The current implementation is HDFS-based. Using the BlockRDD >>> we > can create in-memory-and-local-disk (with replication) checkpoints that >>> are > not as reliable as HDFS-based solution but faster. > > It can help applications that require many iterations. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) > >>> >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
On 18-May-2014 1:45 am, "Mark Hamstra" wrote: > > I'm not trying to muzzle the discussion. All I am saying is that we don't > need to have the same discussion about 0.10 vs. 1.0 that we already had. Agreed, no point in repeating the same discussion ... I am also trying to understand what the concerns are. Specifically though, the scope of 1.0 (in terms of changes) went up quite a bit - a lot of which are new changes and features; not just the initially envisioned api changes and stability fixes. If this is raising concerns, particularly since lot of users are depending on stability of spark interfaces (api, env, scripts, behavior); I want to understand better what they are - and if they are legitimately serious enough, we will need to revisit decision to go to 1.0 instead of 0.10 ... I hope we don't need to though given how late we are in dev cycle Regards Mridul > If you can tell me about specific changes in the current release candidate > that occasion new arguments for why a 1.0 release is an unacceptable idea, > then I'm listening. > > > On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan wrote: > > > On 17-May-2014 11:40 pm, "Mark Hamstra" wrote: > > > > > > That is a past issue that we don't need to be re-opening now. The > > present > > > > Huh ? If we need to revisit based on changed circumstances, we must - the > > scope of changes introduced in this release was definitely not anticipated > > when 1.0 vs 0.10 discussion happened. > > > > If folks are worried about stability of core; it is a valid concern IMO. > > > > Having said that, I am still ok with going to 1.0; but if a conversation > > starts about need for 1.0 vs going to 0.10 I want to hear more and possibly > > allay the concerns and not try to muzzle the discussion. > > > > > > Regards > > Mridul > > > > > issue, and what I am asking, is which pending bug fixes does anyone > > > anticipate will require breaking the public API guaranteed in rc9 > > > > > > > > > On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan > >wrote: > > > > > > > We made incompatible api changes whose impact we don't know yet > > completely > > > > : both from implementation and usage point of view. > > > > > > > > We had the option of getting real-world feedback from the user > > community if > > > > we had gone to 0.10 but the spark developers seemed to be in a hurry to > > get > > > > to 1.0 - so I made my opinion known but left it to the wisdom of larger > > > > group of committers to decide ... I did not think it was critical > > enough to > > > > do a binding -1 on. > > > > > > > > Regards > > > > Mridul > > > > On 17-May-2014 9:43 pm, "Mark Hamstra" > > wrote: > > > > > > > > > Which of the unresolved bugs in spark-core do you think will require > > an > > > > > API-breaking change to fix? If there are none of those, then we are > > > > still > > > > > essentially on track for a 1.0.0 release. > > > > > > > > > > The number of contributions and pace of change now is quite high, but > > I > > > > > don't think that waiting for the pace to slow before releasing 1.0 is > > > > > viable. If Spark's short history is any guide to its near future, > > the > > > > pace > > > > > will not slow by any significant amount for any noteworthy length of > > > > time, > > > > > but rather will continue to increase. What we need to be aiming for, > > I > > > > > think, is to have the great majority of those new contributions being > > > > made > > > > > to MLLlib, GraphX, SparkSQL and other areas of the code that we have > > > > > clearly marked as not frozen in 1.x. I think we are already seeing > > that, > > > > > but if I am just not recognizing breakage of our semantic versioning > > > > > guarantee that will be forced on us by some pending changes, now > > would > > > > be a > > > > > good time to set me straight. > > > > > > > > > > > > > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan < > > mri...@gmail.com > > > > > >wrote: > > > > > > > > > > > I had echoed similar sentiments a while back when there was a > > > > discussion > > > > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the > > api > > > > > > changes, add missing functionality, go through a hardening release > > > > before > > > > > > 1.0 > > > > > > > > > > > > But the community preferred a 1.0 :-) > > > > > > > > > > > > Regards, > > > > > > Mridul > > > > > > > > > > > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > > > > > > > > > > > On this note, non-binding commentary: > > > > > > > > > > > > > > Releases happen in local minima of change, usually created by > > > > > > > internally enforced code freeze. Spark is incredibly busy now due > > to > > > > > > > external factors -- recently a TLP, recently discovered by a > > large > > > > new > > > > > > > audience, ease of contribution enabled by Github. It's getting > > like > > > > > > > the first year of mainstream battle-testing in a month. It's been > > > > very > > > > > > > hard to freeze anyt
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is something that I’ve seen done badly (and experienced the effects thereof) in other big data projects, such as MapReduce and even YARN. The result is that you annoy users, you end up with a fragmented userbase where everyone is building against a different version, and you drastically slow down development. With a project as fast-growing as fast-growing as Spark in particular, there will be new bugs discovered and reported continuously, especially in the non-core components. Look at the graph of # of contributors in time to Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when we started merging each patch as a single commit). This is not slowing down, and we need to have the culture now that we treat API stability and release numbers at the level expected for a 1.0 project instead of having people come in and randomly change the API. I’ll also note that the issues marked “blocker” were marked so by their reporters, since the reporter can set the priority. I don’t consider stuff like parallelize() not partitioning ranges in the same way as other collections a blocker — it’s a bug, it would be good to fix it, but it only affects a small number of use cases. Of course if we find a real blocker (in particular a regression from a previous version, or a feature that’s just completely broken), we will delay the release for that, but at some point you have to say “okay, this fix will go into the next maintenance release”. Maybe we need to write a clear policy for what the issue priorities mean. Finally, I believe it’s much better to have a culture where you can make releases on a regular schedule, and have the option to make a maintenance release in 3-4 days if you find new bugs, than one where you pile up stuff into each release. This is what much large project than us, like Linux, do, and it’s the only way to avoid indefinite stalling with a large contributor base. In the worst case, if you find a new bug that warrants immediate release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug fix in it). And if you find an API that you’d like to improve, just add a new one and maybe deprecate the old one — at some point we have to respect our users and let them know that code they write today will still run tomorrow. Matei On May 17, 2014, at 10:32 AM, Kan Zhang wrote: > +1 on the running commentary here, non-binding of course :-) > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash wrote: > >> +1 on the next release feeling more like a 0.10 than a 1.0 >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" wrote: >> >>> I had echoed similar sentiments a while back when there was a discussion >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api >>> changes, add missing functionality, go through a hardening release before >>> 1.0 >>> >>> But the community preferred a 1.0 :-) >>> >>> Regards, >>> Mridul >>> >>> On 17-May-2014 3:19 pm, "Sean Owen" wrote: On this note, non-binding commentary: Releases happen in local minima of change, usually created by internally enforced code freeze. Spark is incredibly busy now due to external factors -- recently a TLP, recently discovered by a large new audience, ease of contribution enabled by Github. It's getting like the first year of mainstream battle-testing in a month. It's been very hard to freeze anything! I see a number of non-trivial issues being reported, and I don't think it has been possible to triage all of them, even. Given the high rate of change, my instinct would have been to release 0.10.0 now. But won't it always be very busy? I do think the rate of significant issues will slow down. Version ain't nothing but a number, but if it has any meaning it's the semantic versioning meaning. 1.0 imposes extra handicaps around striving to maintain backwards-compatibility. That may end up being bent to fit in important changes that are going to be required in this continuing period of change. Hadoop does this all the time unfortunately and gets away with it, I suppose -- minor version releases are really major. (On the other extreme, HBase is at 0.98 and quite production-ready.) Just consider this a second vote for focus on fixes and 1.0.x rather than new features and 1.x. I think there are a few steps that could streamline triage of this flood of contributions, and make all of this easier, but that's for another thread. On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra >> >>> wrote: > +1,
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I'm not trying to muzzle the discussion. All I am saying is that we don't need to have the same discussion about 0.10 vs. 1.0 that we already had. If you can tell me about specific changes in the current release candidate that occasion new arguments for why a 1.0 release is an unacceptable idea, then I'm listening. On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan wrote: > On 17-May-2014 11:40 pm, "Mark Hamstra" wrote: > > > > That is a past issue that we don't need to be re-opening now. The > present > > Huh ? If we need to revisit based on changed circumstances, we must - the > scope of changes introduced in this release was definitely not anticipated > when 1.0 vs 0.10 discussion happened. > > If folks are worried about stability of core; it is a valid concern IMO. > > Having said that, I am still ok with going to 1.0; but if a conversation > starts about need for 1.0 vs going to 0.10 I want to hear more and possibly > allay the concerns and not try to muzzle the discussion. > > > Regards > Mridul > > > issue, and what I am asking, is which pending bug fixes does anyone > > anticipate will require breaking the public API guaranteed in rc9 > > > > > > On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan >wrote: > > > > > We made incompatible api changes whose impact we don't know yet > completely > > > : both from implementation and usage point of view. > > > > > > We had the option of getting real-world feedback from the user > community if > > > we had gone to 0.10 but the spark developers seemed to be in a hurry to > get > > > to 1.0 - so I made my opinion known but left it to the wisdom of larger > > > group of committers to decide ... I did not think it was critical > enough to > > > do a binding -1 on. > > > > > > Regards > > > Mridul > > > On 17-May-2014 9:43 pm, "Mark Hamstra" > wrote: > > > > > > > Which of the unresolved bugs in spark-core do you think will require > an > > > > API-breaking change to fix? If there are none of those, then we are > > > still > > > > essentially on track for a 1.0.0 release. > > > > > > > > The number of contributions and pace of change now is quite high, but > I > > > > don't think that waiting for the pace to slow before releasing 1.0 is > > > > viable. If Spark's short history is any guide to its near future, > the > > > pace > > > > will not slow by any significant amount for any noteworthy length of > > > time, > > > > but rather will continue to increase. What we need to be aiming for, > I > > > > think, is to have the great majority of those new contributions being > > > made > > > > to MLLlib, GraphX, SparkSQL and other areas of the code that we have > > > > clearly marked as not frozen in 1.x. I think we are already seeing > that, > > > > but if I am just not recognizing breakage of our semantic versioning > > > > guarantee that will be forced on us by some pending changes, now > would > > > be a > > > > good time to set me straight. > > > > > > > > > > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan < > mri...@gmail.com > > > > >wrote: > > > > > > > > > I had echoed similar sentiments a while back when there was a > > > discussion > > > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the > api > > > > > changes, add missing functionality, go through a hardening release > > > before > > > > > 1.0 > > > > > > > > > > But the community preferred a 1.0 :-) > > > > > > > > > > Regards, > > > > > Mridul > > > > > > > > > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > > > > > > > > > On this note, non-binding commentary: > > > > > > > > > > > > Releases happen in local minima of change, usually created by > > > > > > internally enforced code freeze. Spark is incredibly busy now due > to > > > > > > external factors -- recently a TLP, recently discovered by a > large > > > new > > > > > > audience, ease of contribution enabled by Github. It's getting > like > > > > > > the first year of mainstream battle-testing in a month. It's been > > > very > > > > > > hard to freeze anything! I see a number of non-trivial issues > being > > > > > > reported, and I don't think it has been possible to triage all of > > > > > > them, even. > > > > > > > > > > > > Given the high rate of change, my instinct would have been to > release > > > > > > 0.10.0 now. But won't it always be very busy? I do think the rate > of > > > > > > significant issues will slow down. > > > > > > > > > > > > Version ain't nothing but a number, but if it has any meaning > it's > > > the > > > > > > semantic versioning meaning. 1.0 imposes extra handicaps around > > > > > > striving to maintain backwards-compatibility. That may end up > being > > > > > > bent to fit in important changes that are going to be required in > > > this > > > > > > continuing period of change. Hadoop does this all the time > > > > > > unfortunately and gets away with it, I suppose -- minor version > > > > > > releases are really major. (On the other extreme, HBase is at > 0.
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
+1 Reran my tests from rc5: * Built the release from source. * Compiled Java and Scala apps that interact with HDFS against it. * Ran them in local mode. * Ran them against a pseudo-distributed YARN cluster in both yarn-client mode and yarn-cluster mode. On Sat, May 17, 2014 at 10:08 AM, Andrew Or wrote: > +1 > > > 2014-05-17 8:53 GMT-07:00 Mark Hamstra : > > > +1 > > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell > >wrote: > > > > > I'll start the voting with a +1. > > > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell > > > wrote: > > > > Please vote on releasing the following candidate as Apache Spark > > version > > > 1.0.0! > > > > This has one bug fix and one minor feature on top of rc8: > > > > SPARK-1864: https://github.com/apache/spark/pull/808 > > > > SPARK-1808: https://github.com/apache/spark/pull/799 > > > > > > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947): > > > > > > > > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 > > > > > > > > The release files, including signatures, digests, etc. can be found > at: > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9/ > > > > > > > > Release artifacts are signed with the following key: > > > > https://people.apache.org/keys/committer/pwendell.asc > > > > > > > > The staging repository for this release can be found at: > > > > > > https://repository.apache.org/content/repositories/orgapachespark-1017/ > > > > > > > > The documentation corresponding to this release can be found at: > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ > > > > > > > > Please vote on releasing this package as Apache Spark 1.0.0! > > > > > > > > The vote is open until Tuesday, May 20, at 08:56 UTC and passes if > > > > amajority of at least 3 +1 PMC votes are cast. > > > > > > > > [ ] +1 Release this package as Apache Spark 1.0.0 > > > > [ ] -1 Do not release this package because ... > > > > > > > > To learn more about Apache Spark, please see > > > > http://spark.apache.org/ > > > > > > > > == API Changes == > > > > We welcome users to compile Spark applications against 1.0. There are > > > > a few API changes in this release. Here are links to the associated > > > > upgrade guides - user facing changes have been kept as small as > > > > possible. > > > > > > > > changes to ML vector specification: > > > > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 > > > > > > > > changes to the Java API: > > > > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > > > > > > > changes to the streaming API: > > > > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x > > > > > > > > changes to the GraphX API: > > > > > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 > > > > > > > > coGroup and related functions now return Iterable[T] instead of > Seq[T] > > > > ==> Call toSeq on the result to restore the old behavior > > > > > > > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > > > > ==> Call toSeq on the result to restore old behavior > > > > > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
On 17-May-2014 11:40 pm, "Mark Hamstra" wrote: > > That is a past issue that we don't need to be re-opening now. The present Huh ? If we need to revisit based on changed circumstances, we must - the scope of changes introduced in this release was definitely not anticipated when 1.0 vs 0.10 discussion happened. If folks are worried about stability of core; it is a valid concern IMO. Having said that, I am still ok with going to 1.0; but if a conversation starts about need for 1.0 vs going to 0.10 I want to hear more and possibly allay the concerns and not try to muzzle the discussion. Regards Mridul > issue, and what I am asking, is which pending bug fixes does anyone > anticipate will require breaking the public API guaranteed in rc9 > > > On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan wrote: > > > We made incompatible api changes whose impact we don't know yet completely > > : both from implementation and usage point of view. > > > > We had the option of getting real-world feedback from the user community if > > we had gone to 0.10 but the spark developers seemed to be in a hurry to get > > to 1.0 - so I made my opinion known but left it to the wisdom of larger > > group of committers to decide ... I did not think it was critical enough to > > do a binding -1 on. > > > > Regards > > Mridul > > On 17-May-2014 9:43 pm, "Mark Hamstra" wrote: > > > > > Which of the unresolved bugs in spark-core do you think will require an > > > API-breaking change to fix? If there are none of those, then we are > > still > > > essentially on track for a 1.0.0 release. > > > > > > The number of contributions and pace of change now is quite high, but I > > > don't think that waiting for the pace to slow before releasing 1.0 is > > > viable. If Spark's short history is any guide to its near future, the > > pace > > > will not slow by any significant amount for any noteworthy length of > > time, > > > but rather will continue to increase. What we need to be aiming for, I > > > think, is to have the great majority of those new contributions being > > made > > > to MLLlib, GraphX, SparkSQL and other areas of the code that we have > > > clearly marked as not frozen in 1.x. I think we are already seeing that, > > > but if I am just not recognizing breakage of our semantic versioning > > > guarantee that will be forced on us by some pending changes, now would > > be a > > > good time to set me straight. > > > > > > > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan > > >wrote: > > > > > > > I had echoed similar sentiments a while back when there was a > > discussion > > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > > > > changes, add missing functionality, go through a hardening release > > before > > > > 1.0 > > > > > > > > But the community preferred a 1.0 :-) > > > > > > > > Regards, > > > > Mridul > > > > > > > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > > > > > > > On this note, non-binding commentary: > > > > > > > > > > Releases happen in local minima of change, usually created by > > > > > internally enforced code freeze. Spark is incredibly busy now due to > > > > > external factors -- recently a TLP, recently discovered by a large > > new > > > > > audience, ease of contribution enabled by Github. It's getting like > > > > > the first year of mainstream battle-testing in a month. It's been > > very > > > > > hard to freeze anything! I see a number of non-trivial issues being > > > > > reported, and I don't think it has been possible to triage all of > > > > > them, even. > > > > > > > > > > Given the high rate of change, my instinct would have been to release > > > > > 0.10.0 now. But won't it always be very busy? I do think the rate of > > > > > significant issues will slow down. > > > > > > > > > > Version ain't nothing but a number, but if it has any meaning it's > > the > > > > > semantic versioning meaning. 1.0 imposes extra handicaps around > > > > > striving to maintain backwards-compatibility. That may end up being > > > > > bent to fit in important changes that are going to be required in > > this > > > > > continuing period of change. Hadoop does this all the time > > > > > unfortunately and gets away with it, I suppose -- minor version > > > > > releases are really major. (On the other extreme, HBase is at 0.98 > > and > > > > > quite production-ready.) > > > > > > > > > > Just consider this a second vote for focus on fixes and 1.0.x rather > > > > > than new features and 1.x. I think there are a few steps that could > > > > > streamline triage of this flood of contributions, and make all of > > this > > > > > easier, but that's for another thread. > > > > > > > > > > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra < > > m...@clearstorydata.com > > > > > > > > wrote: > > > > > > +1, but just barely. We've got quite a number of outstanding bugs > > > > > > identified, and many of them have fixes in progress. I'd hate to > > see > > > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
+1 on the running commentary here, non-binding of course :-) On Sat, May 17, 2014 at 8:44 AM, Andrew Ash wrote: > +1 on the next release feeling more like a 0.10 than a 1.0 > On May 17, 2014 4:38 AM, "Mridul Muralidharan" wrote: > > > I had echoed similar sentiments a while back when there was a discussion > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > > changes, add missing functionality, go through a hardening release before > > 1.0 > > > > But the community preferred a 1.0 :-) > > > > Regards, > > Mridul > > > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > > > On this note, non-binding commentary: > > > > > > Releases happen in local minima of change, usually created by > > > internally enforced code freeze. Spark is incredibly busy now due to > > > external factors -- recently a TLP, recently discovered by a large new > > > audience, ease of contribution enabled by Github. It's getting like > > > the first year of mainstream battle-testing in a month. It's been very > > > hard to freeze anything! I see a number of non-trivial issues being > > > reported, and I don't think it has been possible to triage all of > > > them, even. > > > > > > Given the high rate of change, my instinct would have been to release > > > 0.10.0 now. But won't it always be very busy? I do think the rate of > > > significant issues will slow down. > > > > > > Version ain't nothing but a number, but if it has any meaning it's the > > > semantic versioning meaning. 1.0 imposes extra handicaps around > > > striving to maintain backwards-compatibility. That may end up being > > > bent to fit in important changes that are going to be required in this > > > continuing period of change. Hadoop does this all the time > > > unfortunately and gets away with it, I suppose -- minor version > > > releases are really major. (On the other extreme, HBase is at 0.98 and > > > quite production-ready.) > > > > > > Just consider this a second vote for focus on fixes and 1.0.x rather > > > than new features and 1.x. I think there are a few steps that could > > > streamline triage of this flood of contributions, and make all of this > > > easier, but that's for another thread. > > > > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra > > > wrote: > > > > +1, but just barely. We've got quite a number of outstanding bugs > > > > identified, and many of them have fixes in progress. I'd hate to see > > those > > > > efforts get lost in a post-1.0.0 flood of new features targeted at > > 1.1.0 -- > > > > in other words, I'd like to see 1.0.1 retain a high priority relative > > to > > > > 1.1.0. > > > > > > > > Looking through the unresolved JIRAs, it doesn't look like any of the > > > > identified bugs are show-stoppers or strictly regressions (although I > > will > > > > note that one that I have in progress, SPARK-1749, is a bug that we > > > > introduced with recent work -- it's not strictly a regression because > > we > > > > had equally bad but different behavior when the DAGScheduler > exceptions > > > > weren't previously being handled at all vs. being slightly > mis-handled > > > > now), so I'm not currently seeing a reason not to release. > > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
That is a past issue that we don't need to be re-opening now. The present issue, and what I am asking, is which pending bug fixes does anyone anticipate will require breaking the public API guaranteed in rc9? On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan wrote: > We made incompatible api changes whose impact we don't know yet completely > : both from implementation and usage point of view. > > We had the option of getting real-world feedback from the user community if > we had gone to 0.10 but the spark developers seemed to be in a hurry to get > to 1.0 - so I made my opinion known but left it to the wisdom of larger > group of committers to decide ... I did not think it was critical enough to > do a binding -1 on. > > Regards > Mridul > On 17-May-2014 9:43 pm, "Mark Hamstra" wrote: > > > Which of the unresolved bugs in spark-core do you think will require an > > API-breaking change to fix? If there are none of those, then we are > still > > essentially on track for a 1.0.0 release. > > > > The number of contributions and pace of change now is quite high, but I > > don't think that waiting for the pace to slow before releasing 1.0 is > > viable. If Spark's short history is any guide to its near future, the > pace > > will not slow by any significant amount for any noteworthy length of > time, > > but rather will continue to increase. What we need to be aiming for, I > > think, is to have the great majority of those new contributions being > made > > to MLLlib, GraphX, SparkSQL and other areas of the code that we have > > clearly marked as not frozen in 1.x. I think we are already seeing that, > > but if I am just not recognizing breakage of our semantic versioning > > guarantee that will be forced on us by some pending changes, now would > be a > > good time to set me straight. > > > > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan > >wrote: > > > > > I had echoed similar sentiments a while back when there was a > discussion > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > > > changes, add missing functionality, go through a hardening release > before > > > 1.0 > > > > > > But the community preferred a 1.0 :-) > > > > > > Regards, > > > Mridul > > > > > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > > > > > On this note, non-binding commentary: > > > > > > > > Releases happen in local minima of change, usually created by > > > > internally enforced code freeze. Spark is incredibly busy now due to > > > > external factors -- recently a TLP, recently discovered by a large > new > > > > audience, ease of contribution enabled by Github. It's getting like > > > > the first year of mainstream battle-testing in a month. It's been > very > > > > hard to freeze anything! I see a number of non-trivial issues being > > > > reported, and I don't think it has been possible to triage all of > > > > them, even. > > > > > > > > Given the high rate of change, my instinct would have been to release > > > > 0.10.0 now. But won't it always be very busy? I do think the rate of > > > > significant issues will slow down. > > > > > > > > Version ain't nothing but a number, but if it has any meaning it's > the > > > > semantic versioning meaning. 1.0 imposes extra handicaps around > > > > striving to maintain backwards-compatibility. That may end up being > > > > bent to fit in important changes that are going to be required in > this > > > > continuing period of change. Hadoop does this all the time > > > > unfortunately and gets away with it, I suppose -- minor version > > > > releases are really major. (On the other extreme, HBase is at 0.98 > and > > > > quite production-ready.) > > > > > > > > Just consider this a second vote for focus on fixes and 1.0.x rather > > > > than new features and 1.x. I think there are a few steps that could > > > > streamline triage of this flood of contributions, and make all of > this > > > > easier, but that's for another thread. > > > > > > > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra < > m...@clearstorydata.com > > > > > > wrote: > > > > > +1, but just barely. We've got quite a number of outstanding bugs > > > > > identified, and many of them have fixes in progress. I'd hate to > see > > > those > > > > > efforts get lost in a post-1.0.0 flood of new features targeted at > > > 1.1.0 -- > > > > > in other words, I'd like to see 1.0.1 retain a high priority > relative > > > to > > > > > 1.1.0. > > > > > > > > > > Looking through the unresolved JIRAs, it doesn't look like any of > the > > > > > identified bugs are show-stoppers or strictly regressions > (although I > > > will > > > > > note that one that I have in progress, SPARK-1749, is a bug that we > > > > > introduced with recent work -- it's not strictly a regression > because > > > we > > > > > had equally bad but different behavior when the DAGScheduler > > exceptions > > > > > weren't previously being handled at all vs. being slightly > > mis-handled > > > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
+1 2014-05-17 8:53 GMT-07:00 Mark Hamstra : > +1 > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell >wrote: > > > I'll start the voting with a +1. > > > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell > > wrote: > > > Please vote on releasing the following candidate as Apache Spark > version > > 1.0.0! > > > This has one bug fix and one minor feature on top of rc8: > > > SPARK-1864: https://github.com/apache/spark/pull/808 > > > SPARK-1808: https://github.com/apache/spark/pull/799 > > > > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947): > > > > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 > > > > > > The release files, including signatures, digests, etc. can be found at: > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9/ > > > > > > Release artifacts are signed with the following key: > > > https://people.apache.org/keys/committer/pwendell.asc > > > > > > The staging repository for this release can be found at: > > > > https://repository.apache.org/content/repositories/orgapachespark-1017/ > > > > > > The documentation corresponding to this release can be found at: > > > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ > > > > > > Please vote on releasing this package as Apache Spark 1.0.0! > > > > > > The vote is open until Tuesday, May 20, at 08:56 UTC and passes if > > > amajority of at least 3 +1 PMC votes are cast. > > > > > > [ ] +1 Release this package as Apache Spark 1.0.0 > > > [ ] -1 Do not release this package because ... > > > > > > To learn more about Apache Spark, please see > > > http://spark.apache.org/ > > > > > > == API Changes == > > > We welcome users to compile Spark applications against 1.0. There are > > > a few API changes in this release. Here are links to the associated > > > upgrade guides - user facing changes have been kept as small as > > > possible. > > > > > > changes to ML vector specification: > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 > > > > > > changes to the Java API: > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > > > > > changes to the streaming API: > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x > > > > > > changes to the GraphX API: > > > > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 > > > > > > coGroup and related functions now return Iterable[T] instead of Seq[T] > > > ==> Call toSeq on the result to restore the old behavior > > > > > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > > > ==> Call toSeq on the result to restore old behavior > > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
We made incompatible api changes whose impact we don't know yet completely : both from implementation and usage point of view. We had the option of getting real-world feedback from the user community if we had gone to 0.10 but the spark developers seemed to be in a hurry to get to 1.0 - so I made my opinion known but left it to the wisdom of larger group of committers to decide ... I did not think it was critical enough to do a binding -1 on. Regards Mridul On 17-May-2014 9:43 pm, "Mark Hamstra" wrote: > Which of the unresolved bugs in spark-core do you think will require an > API-breaking change to fix? If there are none of those, then we are still > essentially on track for a 1.0.0 release. > > The number of contributions and pace of change now is quite high, but I > don't think that waiting for the pace to slow before releasing 1.0 is > viable. If Spark's short history is any guide to its near future, the pace > will not slow by any significant amount for any noteworthy length of time, > but rather will continue to increase. What we need to be aiming for, I > think, is to have the great majority of those new contributions being made > to MLLlib, GraphX, SparkSQL and other areas of the code that we have > clearly marked as not frozen in 1.x. I think we are already seeing that, > but if I am just not recognizing breakage of our semantic versioning > guarantee that will be forced on us by some pending changes, now would be a > good time to set me straight. > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan >wrote: > > > I had echoed similar sentiments a while back when there was a discussion > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > > changes, add missing functionality, go through a hardening release before > > 1.0 > > > > But the community preferred a 1.0 :-) > > > > Regards, > > Mridul > > > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > > > On this note, non-binding commentary: > > > > > > Releases happen in local minima of change, usually created by > > > internally enforced code freeze. Spark is incredibly busy now due to > > > external factors -- recently a TLP, recently discovered by a large new > > > audience, ease of contribution enabled by Github. It's getting like > > > the first year of mainstream battle-testing in a month. It's been very > > > hard to freeze anything! I see a number of non-trivial issues being > > > reported, and I don't think it has been possible to triage all of > > > them, even. > > > > > > Given the high rate of change, my instinct would have been to release > > > 0.10.0 now. But won't it always be very busy? I do think the rate of > > > significant issues will slow down. > > > > > > Version ain't nothing but a number, but if it has any meaning it's the > > > semantic versioning meaning. 1.0 imposes extra handicaps around > > > striving to maintain backwards-compatibility. That may end up being > > > bent to fit in important changes that are going to be required in this > > > continuing period of change. Hadoop does this all the time > > > unfortunately and gets away with it, I suppose -- minor version > > > releases are really major. (On the other extreme, HBase is at 0.98 and > > > quite production-ready.) > > > > > > Just consider this a second vote for focus on fixes and 1.0.x rather > > > than new features and 1.x. I think there are a few steps that could > > > streamline triage of this flood of contributions, and make all of this > > > easier, but that's for another thread. > > > > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra > > > wrote: > > > > +1, but just barely. We've got quite a number of outstanding bugs > > > > identified, and many of them have fixes in progress. I'd hate to see > > those > > > > efforts get lost in a post-1.0.0 flood of new features targeted at > > 1.1.0 -- > > > > in other words, I'd like to see 1.0.1 retain a high priority relative > > to > > > > 1.1.0. > > > > > > > > Looking through the unresolved JIRAs, it doesn't look like any of the > > > > identified bugs are show-stoppers or strictly regressions (although I > > will > > > > note that one that I have in progress, SPARK-1749, is a bug that we > > > > introduced with recent work -- it's not strictly a regression because > > we > > > > had equally bad but different behavior when the DAGScheduler > exceptions > > > > weren't previously being handled at all vs. being slightly > mis-handled > > > > now), so I'm not currently seeing a reason not to release. > > >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
On Sat, May 17, 2014 at 4:52 PM, Mark Hamstra wrote: > Which of the unresolved bugs in spark-core do you think will require an > API-breaking change to fix? If there are none of those, then we are still > essentially on track for a 1.0.0 release. I don't have a particular one in mind, but look at https://issues.apache.org/jira/browse/SPARK-1817?filter=12327229 for example. There are 10 issues marked blocker or critical, that are targeted at Core / 1.0.0 (or unset). Many are probably not critical, not for 1.0, or wouldn't require a big change to fix. But has this been reviewed then -- can you tell? I'd be happy for someone to tell me to stop worrying, yeah, there's nothing too big here. > The number of contributions and pace of change now is quite high, but I > don't think that waiting for the pace to slow before releasing 1.0 is > viable. If Spark's short history is any guide to its near future, the pace > will not slow by any significant amount for any noteworthy length of time, I think we'd agree core is the most important part. I'd humbly suggest fixes and improvements to core remain exceptionally important after 1.0 and there is a long line of proposed changes, most good. Would be great to really burn that down. Maybe that is the kind of thing I personally would have preferred to see before a 1.0, but it's not up to me and there are other factors at work here. I don't object strongly or anything.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
+1 on the next release feeling more like a 0.10 than a 1.0 On May 17, 2014 4:38 AM, "Mridul Muralidharan" wrote: > I had echoed similar sentiments a while back when there was a discussion > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > changes, add missing functionality, go through a hardening release before > 1.0 > > But the community preferred a 1.0 :-) > > Regards, > Mridul > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > On this note, non-binding commentary: > > > > Releases happen in local minima of change, usually created by > > internally enforced code freeze. Spark is incredibly busy now due to > > external factors -- recently a TLP, recently discovered by a large new > > audience, ease of contribution enabled by Github. It's getting like > > the first year of mainstream battle-testing in a month. It's been very > > hard to freeze anything! I see a number of non-trivial issues being > > reported, and I don't think it has been possible to triage all of > > them, even. > > > > Given the high rate of change, my instinct would have been to release > > 0.10.0 now. But won't it always be very busy? I do think the rate of > > significant issues will slow down. > > > > Version ain't nothing but a number, but if it has any meaning it's the > > semantic versioning meaning. 1.0 imposes extra handicaps around > > striving to maintain backwards-compatibility. That may end up being > > bent to fit in important changes that are going to be required in this > > continuing period of change. Hadoop does this all the time > > unfortunately and gets away with it, I suppose -- minor version > > releases are really major. (On the other extreme, HBase is at 0.98 and > > quite production-ready.) > > > > Just consider this a second vote for focus on fixes and 1.0.x rather > > than new features and 1.x. I think there are a few steps that could > > streamline triage of this flood of contributions, and make all of this > > easier, but that's for another thread. > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra > wrote: > > > +1, but just barely. We've got quite a number of outstanding bugs > > > identified, and many of them have fixes in progress. I'd hate to see > those > > > efforts get lost in a post-1.0.0 flood of new features targeted at > 1.1.0 -- > > > in other words, I'd like to see 1.0.1 retain a high priority relative > to > > > 1.1.0. > > > > > > Looking through the unresolved JIRAs, it doesn't look like any of the > > > identified bugs are show-stoppers or strictly regressions (although I > will > > > note that one that I have in progress, SPARK-1749, is a bug that we > > > introduced with recent work -- it's not strictly a regression because > we > > > had equally bad but different behavior when the DAGScheduler exceptions > > > weren't previously being handled at all vs. being slightly mis-handled > > > now), so I'm not currently seeing a reason not to release. >
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
+1 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell wrote: > I'll start the voting with a +1. > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell > wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.0.0! > > This has one bug fix and one minor feature on top of rc8: > > SPARK-1864: https://github.com/apache/spark/pull/808 > > SPARK-1808: https://github.com/apache/spark/pull/799 > > > > The tag to be voted on is v1.0.0-rc9 (commit 920f947): > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 > > > > The release files, including signatures, digests, etc. can be found at: > > http://people.apache.org/~pwendell/spark-1.0.0-rc9/ > > > > Release artifacts are signed with the following key: > > https://people.apache.org/keys/committer/pwendell.asc > > > > The staging repository for this release can be found at: > > https://repository.apache.org/content/repositories/orgapachespark-1017/ > > > > The documentation corresponding to this release can be found at: > > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ > > > > Please vote on releasing this package as Apache Spark 1.0.0! > > > > The vote is open until Tuesday, May 20, at 08:56 UTC and passes if > > amajority of at least 3 +1 PMC votes are cast. > > > > [ ] +1 Release this package as Apache Spark 1.0.0 > > [ ] -1 Do not release this package because ... > > > > To learn more about Apache Spark, please see > > http://spark.apache.org/ > > > > == API Changes == > > We welcome users to compile Spark applications against 1.0. There are > > a few API changes in this release. Here are links to the associated > > upgrade guides - user facing changes have been kept as small as > > possible. > > > > changes to ML vector specification: > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 > > > > changes to the Java API: > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > > > changes to the streaming API: > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x > > > > changes to the GraphX API: > > > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 > > > > coGroup and related functions now return Iterable[T] instead of Seq[T] > > ==> Call toSeq on the result to restore the old behavior > > > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > > ==> Call toSeq on the result to restore old behavior >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
Which of the unresolved bugs in spark-core do you think will require an API-breaking change to fix? If there are none of those, then we are still essentially on track for a 1.0.0 release. The number of contributions and pace of change now is quite high, but I don't think that waiting for the pace to slow before releasing 1.0 is viable. If Spark's short history is any guide to its near future, the pace will not slow by any significant amount for any noteworthy length of time, but rather will continue to increase. What we need to be aiming for, I think, is to have the great majority of those new contributions being made to MLLlib, GraphX, SparkSQL and other areas of the code that we have clearly marked as not frozen in 1.x. I think we are already seeing that, but if I am just not recognizing breakage of our semantic versioning guarantee that will be forced on us by some pending changes, now would be a good time to set me straight. On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan wrote: > I had echoed similar sentiments a while back when there was a discussion > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api > changes, add missing functionality, go through a hardening release before > 1.0 > > But the community preferred a 1.0 :-) > > Regards, > Mridul > > On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > > > On this note, non-binding commentary: > > > > Releases happen in local minima of change, usually created by > > internally enforced code freeze. Spark is incredibly busy now due to > > external factors -- recently a TLP, recently discovered by a large new > > audience, ease of contribution enabled by Github. It's getting like > > the first year of mainstream battle-testing in a month. It's been very > > hard to freeze anything! I see a number of non-trivial issues being > > reported, and I don't think it has been possible to triage all of > > them, even. > > > > Given the high rate of change, my instinct would have been to release > > 0.10.0 now. But won't it always be very busy? I do think the rate of > > significant issues will slow down. > > > > Version ain't nothing but a number, but if it has any meaning it's the > > semantic versioning meaning. 1.0 imposes extra handicaps around > > striving to maintain backwards-compatibility. That may end up being > > bent to fit in important changes that are going to be required in this > > continuing period of change. Hadoop does this all the time > > unfortunately and gets away with it, I suppose -- minor version > > releases are really major. (On the other extreme, HBase is at 0.98 and > > quite production-ready.) > > > > Just consider this a second vote for focus on fixes and 1.0.x rather > > than new features and 1.x. I think there are a few steps that could > > streamline triage of this flood of contributions, and make all of this > > easier, but that's for another thread. > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra > wrote: > > > +1, but just barely. We've got quite a number of outstanding bugs > > > identified, and many of them have fixes in progress. I'd hate to see > those > > > efforts get lost in a post-1.0.0 flood of new features targeted at > 1.1.0 -- > > > in other words, I'd like to see 1.0.1 retain a high priority relative > to > > > 1.1.0. > > > > > > Looking through the unresolved JIRAs, it doesn't look like any of the > > > identified bugs are show-stoppers or strictly regressions (although I > will > > > note that one that I have in progress, SPARK-1749, is a bug that we > > > introduced with recent work -- it's not strictly a regression because > we > > > had equally bad but different behavior when the DAGScheduler exceptions > > > weren't previously being handled at all vs. being slightly mis-handled > > > now), so I'm not currently seeing a reason not to release. >
Re: [jira] [Created] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
I suspect this is an issue we have fixed internally here as part of a larger change - the issue we fixed was not a config issue but bugs in spark. Unfortunately we plan to contribute this as part of 1.1 Regards, Mridul On 17-May-2014 4:09 pm, "sam (JIRA)" wrote: > sam created SPARK-1867: > -- > > Summary: Spark Documentation Error causes > java.lang.IllegalStateException: unread block data > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug > Reporter: sam > > > I've employed two System Administrators on a contract basis (for quite a > bit of money), and both contractors have independently hit the following > exception. What we are doing is: > > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on > the cluster > > I've also included code snippets, and sbt deps at the bottom. > > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > > Now I know that (b) is not the problem having successfully run the same > code on other clusters while only including one jar (it's a fat jar). > > But I have no idea how to check for (a) - it appears Spark doesn't have > any version checks or anything - it would be nice if it checked versions > and threw a "mismatching version exception: you have user code using > version X and node Y has version Z". > > I would be very grateful for advice on this. > > The exception: > > Exception in thread "main" org.apache.spark.SparkException: Job aborted: > Task 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > > My code snippet: > > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkContext.jarOfClass(this.getClass)) > > println("count = " + new SparkContext(conf).textFile(someHdfsPath).count()) > > My SBT dependencies: > > // relevant > "org.apache.spark" % "spark-core_2.10" % "0.9.1", > "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", > > // standard, probably unrelated > "com.github.seratch" %% "awscala" % "[0.2,)", > "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", > "org.specs2" %% "specs2" % "1.14" % "test", > "org.scala-lang" % "scala-reflect" % "2.10.3", > "org.scalaz" %% "scalaz-core" % "7.0.5", > "net.minidev" % "json-smart" % "1.2" > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
I had echoed similar sentiments a while back when there was a discussion around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api changes, add missing functionality, go through a hardening release before 1.0 But the community preferred a 1.0 :-) Regards, Mridul On 17-May-2014 3:19 pm, "Sean Owen" wrote: > > On this note, non-binding commentary: > > Releases happen in local minima of change, usually created by > internally enforced code freeze. Spark is incredibly busy now due to > external factors -- recently a TLP, recently discovered by a large new > audience, ease of contribution enabled by Github. It's getting like > the first year of mainstream battle-testing in a month. It's been very > hard to freeze anything! I see a number of non-trivial issues being > reported, and I don't think it has been possible to triage all of > them, even. > > Given the high rate of change, my instinct would have been to release > 0.10.0 now. But won't it always be very busy? I do think the rate of > significant issues will slow down. > > Version ain't nothing but a number, but if it has any meaning it's the > semantic versioning meaning. 1.0 imposes extra handicaps around > striving to maintain backwards-compatibility. That may end up being > bent to fit in important changes that are going to be required in this > continuing period of change. Hadoop does this all the time > unfortunately and gets away with it, I suppose -- minor version > releases are really major. (On the other extreme, HBase is at 0.98 and > quite production-ready.) > > Just consider this a second vote for focus on fixes and 1.0.x rather > than new features and 1.x. I think there are a few steps that could > streamline triage of this flood of contributions, and make all of this > easier, but that's for another thread. > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra wrote: > > +1, but just barely. We've got quite a number of outstanding bugs > > identified, and many of them have fixes in progress. I'd hate to see those > > efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 -- > > in other words, I'd like to see 1.0.1 retain a high priority relative to > > 1.1.0. > > > > Looking through the unresolved JIRAs, it doesn't look like any of the > > identified bugs are show-stoppers or strictly regressions (although I will > > note that one that I have in progress, SPARK-1749, is a bug that we > > introduced with recent work -- it's not strictly a regression because we > > had equally bad but different behavior when the DAGScheduler exceptions > > weren't previously being handled at all vs. being slightly mis-handled > > now), so I'm not currently seeing a reason not to release.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
On this note, non-binding commentary: Releases happen in local minima of change, usually created by internally enforced code freeze. Spark is incredibly busy now due to external factors -- recently a TLP, recently discovered by a large new audience, ease of contribution enabled by Github. It's getting like the first year of mainstream battle-testing in a month. It's been very hard to freeze anything! I see a number of non-trivial issues being reported, and I don't think it has been possible to triage all of them, even. Given the high rate of change, my instinct would have been to release 0.10.0 now. But won't it always be very busy? I do think the rate of significant issues will slow down. Version ain't nothing but a number, but if it has any meaning it's the semantic versioning meaning. 1.0 imposes extra handicaps around striving to maintain backwards-compatibility. That may end up being bent to fit in important changes that are going to be required in this continuing period of change. Hadoop does this all the time unfortunately and gets away with it, I suppose -- minor version releases are really major. (On the other extreme, HBase is at 0.98 and quite production-ready.) Just consider this a second vote for focus on fixes and 1.0.x rather than new features and 1.x. I think there are a few steps that could streamline triage of this flood of contributions, and make all of this easier, but that's for another thread. On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra wrote: > +1, but just barely. We've got quite a number of outstanding bugs > identified, and many of them have fixes in progress. I'd hate to see those > efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 -- > in other words, I'd like to see 1.0.1 retain a high priority relative to > 1.1.0. > > Looking through the unresolved JIRAs, it doesn't look like any of the > identified bugs are show-stoppers or strictly regressions (although I will > note that one that I have in progress, SPARK-1749, is a bug that we > introduced with recent work -- it's not strictly a regression because we > had equally bad but different behavior when the DAGScheduler exceptions > weren't previously being handled at all vs. being slightly mis-handled > now), so I'm not currently seeing a reason not to release.
Re: [jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
We don't have 3x replication in spark :-) And if we use replicated storagelevel, while decreasing odds of failure, it does not eliminate it (since we are not doing a great job with replication anyway from fault tolerance point of view). Also it does take a nontrivial performance hit with replicated levels. Regards, Mridul On 17-May-2014 8:16 am, "Xiangrui Meng" wrote: > With 3x replication, we should be able to achieve fault tolerance. > This checkPointed RDD can be cleared if we have another in-memory > checkPointed RDD down the line. It can avoid hitting disk if we have > enough memory to use. We need to investigate more to find a good > solution. -Xiangrui > > On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan > wrote: > > Effectively this is persist without fault tolerance. > > Failure of any node means complete lack of fault tolerance. > > I would be very skeptical of truncating lineage if it is not reliable. > > On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" wrote: > > > >> Xiangrui Meng created SPARK-1855: > >> > >> > >> Summary: Provide memory-and-local-disk RDD checkpointing > >> Key: SPARK-1855 > >> URL: https://issues.apache.org/jira/browse/SPARK-1855 > >> Project: Spark > >> Issue Type: New Feature > >> Components: MLlib, Spark Core > >> Affects Versions: 1.0.0 > >> Reporter: Xiangrui Meng > >> > >> > >> Checkpointing is used to cut long lineage while maintaining fault > >> tolerance. The current implementation is HDFS-based. Using the BlockRDD > we > >> can create in-memory-and-local-disk (with replication) checkpoints that > are > >> not as reliable as HDFS-based solution but faster. > >> > >> It can help applications that require many iterations. > >> > >> > >> > >> -- > >> This message was sent by Atlassian JIRA > >> (v6.2#6252) > >> >
Re: Calling external classes added by sc.addJar needs to be through reflection
Can you try moving your mapPartitions to another class/object which is referenced only after sc.addJar ? I would suspect CNFEx is coming while loading the class containing mapPartitions before addJars is executed. In general though, dynamic loading of classes means you use reflection to instantiate it since expectation is you don't know which implementation provides the interface ... If you statically know it apriori, you bundle it in your classpath. Regards Mridul On 17-May-2014 7:28 am, "DB Tsai" wrote: > Finally find a way out of the ClassLoader maze! It took me some times to > understand how it works; I think it worths to document it in a separated > thread. > > We're trying to add external utility.jar which contains CSVRecordParser, > and we added the jar to executors through sc.addJar APIs. > > If the instance of CSVRecordParser is created without reflection, it > raises *ClassNotFound > Exception*. > > data.mapPartitions(lines => { > val csvParser = new CSVRecordParser((delimiter.charAt(0)) > lines.foreach(line => { > val lineElems = csvParser.parseLine(line) > }) > ... > ... > ) > > > If the instance of CSVRecordParser is created through reflection, it works. > > data.mapPartitions(lines => { > val loader = Thread.currentThread.getContextClassLoader > val CSVRecordParser = > loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser") > > val csvParser = CSVRecordParser.getConstructor(Character.TYPE) > .newInstance(delimiter.charAt(0).asInstanceOf[Character]) > > val parseLine = CSVRecordParser > .getDeclaredMethod("parseLine", classOf[String]) > > lines.foreach(line => { >val lineElems = parseLine.invoke(csvParser, > line).asInstanceOf[Array[String]] > }) > ... > ... > ) > > > This is identical to this question, > > http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection > > It's not intuitive for users to load external classes through reflection, > but couple available solutions including 1) messing around > systemClassLoader by calling systemClassLoader.addURI through reflection or > 2) forking another JVM to add jars into classpath before bootstrap loader > are very tricky. > > Any thought on fixing it properly? > > @Xiangrui, > netlib-java jniloader is loaded from netlib-java through reflection, so > this problem will not be seen. > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai >
[VOTE] Release Apache Spark 1.0.0 (rc9)
Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has one bug fix and one minor feature on top of rc8: SPARK-1864: https://github.com/apache/spark/pull/808 SPARK-1808: https://github.com/apache/spark/pull/799 The tag to be voted on is v1.0.0-rc9 (commit 920f947): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1017/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Tuesday, May 20, at 08:56 UTC and passes if amajority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark changes to the streaming API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x changes to the GraphX API: http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 coGroup and related functions now return Iterable[T] instead of Seq[T] ==> Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] ==> Call toSeq on the result to restore old behavior
Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
I'll start the voting with a +1. On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.0.0! > This has one bug fix and one minor feature on top of rc8: > SPARK-1864: https://github.com/apache/spark/pull/808 > SPARK-1808: https://github.com/apache/spark/pull/799 > > The tag to be voted on is v1.0.0-rc9 (commit 920f947): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75 > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-1.0.0-rc9/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1017/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/ > > Please vote on releasing this package as Apache Spark 1.0.0! > > The vote is open until Tuesday, May 20, at 08:56 UTC and passes if > amajority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.0.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see > http://spark.apache.org/ > > == API Changes == > We welcome users to compile Spark applications against 1.0. There are > a few API changes in this release. Here are links to the associated > upgrade guides - user facing changes have been kept as small as > possible. > > changes to ML vector specification: > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 > > changes to the Java API: > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > changes to the streaming API: > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x > > changes to the GraphX API: > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 > > coGroup and related functions now return Iterable[T] instead of Seq[T] > ==> Call toSeq on the result to restore the old behavior > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > ==> Call toSeq on the result to restore old behavior
[RESULT] [VOTE] Release Apache Spark 1.0.0 (rc8)
Cancelled in favor of rc9. On Sat, May 17, 2014 at 12:51 AM, Patrick Wendell wrote: > Due to the issue discovered by Michael, this vote is cancelled in favor of > rc9. > > On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust > wrote: >> -1 >> >> We found a regression in the way configuration is passed to executors. >> >> https://issues.apache.org/jira/browse/SPARK-1864 >> https://github.com/apache/spark/pull/808 >> >> Michael >> >> >> On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra >> wrote: >>> >>> +1 >>> >>> >>> On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell >>> wrote: >>> >>> > [Due to ASF e-mail outage, I'm not if anyone will actually receive >>> > this.] >>> > >>> > Please vote on releasing the following candidate as Apache Spark version >>> > 1.0.0! >>> > This has only minor changes on top of rc7. >>> > >>> > The tag to be voted on is v1.0.0-rc8 (commit 80eea0f): >>> > >>> > >>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc >>> > >>> > The release files, including signatures, digests, etc. can be found at: >>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8/ >>> > >>> > Release artifacts are signed with the following key: >>> > https://people.apache.org/keys/committer/pwendell.asc >>> > >>> > The staging repository for this release can be found at: >>> > https://repository.apache.org/content/repositories/orgapachespark-1016/ >>> > >>> > The documentation corresponding to this release can be found at: >>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/ >>> > >>> > Please vote on releasing this package as Apache Spark 1.0.0! >>> > >>> > The vote is open until Monday, May 19, at 10:15 UTC and passes if a >>> > majority of at least 3 +1 PMC votes are cast. >>> > >>> > [ ] +1 Release this package as Apache Spark 1.0.0 >>> > [ ] -1 Do not release this package because ... >>> > >>> > To learn more about Apache Spark, please see >>> > http://spark.apache.org/ >>> > >>> > == API Changes == >>> > We welcome users to compile Spark applications against 1.0. There are >>> > a few API changes in this release. Here are links to the associated >>> > upgrade guides - user facing changes have been kept as small as >>> > possible. >>> > >>> > changes to ML vector specification: >>> > >>> > >>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 >>> > >>> > changes to the Java API: >>> > >>> > >>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark >>> > >>> > changes to the streaming API: >>> > >>> > >>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x >>> > >>> > changes to the GraphX API: >>> > >>> > >>> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 >>> > >>> > coGroup and related functions now return Iterable[T] instead of Seq[T] >>> > ==> Call toSeq on the result to restore the old behavior >>> > >>> > SparkContext.jarOfClass returns Option[String] instead of Seq[String] >>> > ==> Call toSeq on the result to restore old behavior >>> > >> >>
Re: [VOTE] Release Apache Spark 1.0.0 (rc8)
Due to the issue discovered by Michael, this vote is cancelled in favor of rc9. On Fri, May 16, 2014 at 6:22 PM, Michael Armbrust wrote: > -1 > > We found a regression in the way configuration is passed to executors. > > https://issues.apache.org/jira/browse/SPARK-1864 > https://github.com/apache/spark/pull/808 > > Michael > > > On Fri, May 16, 2014 at 3:57 PM, Mark Hamstra > wrote: >> >> +1 >> >> >> On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell >> wrote: >> >> > [Due to ASF e-mail outage, I'm not if anyone will actually receive >> > this.] >> > >> > Please vote on releasing the following candidate as Apache Spark version >> > 1.0.0! >> > This has only minor changes on top of rc7. >> > >> > The tag to be voted on is v1.0.0-rc8 (commit 80eea0f): >> > >> > >> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=80eea0f111c06260ffaa780d2f3f7facd09c17bc >> > >> > The release files, including signatures, digests, etc. can be found at: >> > http://people.apache.org/~pwendell/spark-1.0.0-rc8/ >> > >> > Release artifacts are signed with the following key: >> > https://people.apache.org/keys/committer/pwendell.asc >> > >> > The staging repository for this release can be found at: >> > https://repository.apache.org/content/repositories/orgapachespark-1016/ >> > >> > The documentation corresponding to this release can be found at: >> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/ >> > >> > Please vote on releasing this package as Apache Spark 1.0.0! >> > >> > The vote is open until Monday, May 19, at 10:15 UTC and passes if a >> > majority of at least 3 +1 PMC votes are cast. >> > >> > [ ] +1 Release this package as Apache Spark 1.0.0 >> > [ ] -1 Do not release this package because ... >> > >> > To learn more about Apache Spark, please see >> > http://spark.apache.org/ >> > >> > == API Changes == >> > We welcome users to compile Spark applications against 1.0. There are >> > a few API changes in this release. Here are links to the associated >> > upgrade guides - user facing changes have been kept as small as >> > possible. >> > >> > changes to ML vector specification: >> > >> > >> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib-guide.html#from-09-to-10 >> > >> > changes to the Java API: >> > >> > >> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark >> > >> > changes to the streaming API: >> > >> > >> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x >> > >> > changes to the GraphX API: >> > >> > >> > http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 >> > >> > coGroup and related functions now return Iterable[T] instead of Seq[T] >> > ==> Call toSeq on the result to restore the old behavior >> > >> > SparkContext.jarOfClass returns Option[String] instead of Seq[String] >> > ==> Call toSeq on the result to restore old behavior >> > > >