Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
+1 (non-binding, doc and packaging issues aside) Built from source, ran jobs and spark-shell against a pseudo-distributed YARN cluster. On Sun, Mar 8, 2015 at 2:42 PM, Krishna Sankar wrote: > Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop > Distributions X ... > > May be one option is to have a minimum basic set (which I know is what we > are discussing) and move the rest to spark-packages.org. There the vendors > can add the latest downloads - for example when 1.4 is released, HDP can > build a release of HDP Spark 1.4 bundle. > > Cheers > > > On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell > wrote: > > > We probably want to revisit the way we do binaries in general for > > 1.4+. IMO, something worth forking a separate thread for. > > > > I've been hesitating to add new binaries because people > > (understandably) complain if you ever stop packaging older ones, but > > on the other hand the ASF has complained that we have too many > > binaries already and that we need to pare it down because of the large > > volume of files. Doubling the number of binaries we produce for Scala > > 2.11 seemed like it would be too much. > > > > One solution potentially is to actually package "Hadoop provided" > > binaries and encourage users to use these by simply setting > > HADOOP_HOME, or have instructions for specific distros. I've heard > > that our existing packages don't work well on HDP for instance, since > > there are some configuration quirks that differ from the upstream > > Hadoop. > > > > If we cut down on the cross building for Hadoop versions, then it is > > more tenable to cross build for Scala versions without exploding the > > number of binaries. > > > > - Patrick > > > > On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: > > > Yeah, interesting question of what is the better default for the > > > single set of artifacts published to Maven. I think there's an > > > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros > > > and cons discussed more at > > > > > > https://issues.apache.org/jira/browse/SPARK-5134 > > > https://github.com/apache/spark/pull/3917 > > > > > > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia > > > wrote: > > >> +1 > > >> > > >> Tested it on Mac OS X. > > >> > > >> One small issue I noticed is that the Scala 2.11 build is using Hadoop > > 1 without Hive, which is kind of weird because people will more likely > want > > Hadoop 2 with Hive. So it would be good to publish a build for that > > configuration instead. We can do it if we do a new RC, or it might be > that > > binary builds may not need to be voted on (I forgot the details there). > > >> > > >> Matei > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > >
Re: Block Transfer Service encryption support
Hey Patrick, Yes, I will open a Jira tomorrow... For now my implementation is a basic SSL implementation for the TransportServer and TransportClient.. I will type up the design and at the same time look at the Hadoop impl for possible improvements... Cheers! Jeff On Sun, Mar 8, 2015 at 5:51 PM, Patrick Wendell wrote: > I think that yes, longer term we want to have encryption of all > communicated data. However Jeff, can you open a JIRA to discuss the > design before opening a pull request (it's fine to link to a WIP > branch if you'd like)? I'd like to better understand the performance > and operational complexity of using SSL for this in comparison with > alternatives. It would also be good to look at how the Hadoop > encryption works for their shuffle service, in terms of the design > decisions made there. > > - Patrick > > On Sun, Mar 8, 2015 at 5:42 PM, Jeff Turpin wrote: > > I have already written most of the code, just finishing up the unit tests > > right now... > > > > Jeff > > > > > > On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash wrote: > > > >> I'm interested in seeing this data transfer occurring over encrypted > >> communication channels as well. Many customers require that all network > >> transfer occur encrypted to prevent the "soft underbelly" that's often > >> found inside a corporate network. > >> > >> On Fri, Mar 6, 2015 at 4:20 PM, turp1twin wrote: > >> > >>> Is there a plan to implement SSL support for the Block Transfer Service > >>> (specifically, the NettyBlockTransferService implementation)? I can > >>> volunteer if needed... > >>> > >>> Jeff > >>> > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html > >>> Sent from the Apache Spark Developers List mailing list archive at > >>> Nabble.com. > >>> > >>> - > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: dev-h...@spark.apache.org > >>> > >>> > >> >
GSoC 2015
Hi Spark devs! I'm writing regarding your GSoC 2015 project idea. I'm a graduate student with experience in Python and discrete mathematics. I'm interested in machine learning, and understand some of its basic concepts. I was wondering if someone might be able to elaborate upon the goals for Spark with GSoC (it is my understanding that Manoj Kumar is the mentor), though I may be incorrect. I have been reading the Spark codebase on GitHub and think I may be able to help develop Spark's Python API. To get involved, what next steps should I take? Thanks! David J. Manglano Masters Program in Computer Science University of Chicago
Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
Yeah, my concern is that people should get Apache Spark from *Apache*, not from a vendor. It helps everyone use the latest features no matter where they are. In the Hadoop distro case, Hadoop made all this effort to have standard APIs (e.g. YARN), so it should be easy. But it is a problem if we're not packaging for the newest versions of some distros; I think we just fell behind at Hadoop 2.4. Matei > On Mar 8, 2015, at 8:02 PM, Sean Owen wrote: > > Yeah it's not much overhead, but here's an example of where it causes > a little issue. > > I like that reasoning. However, the released builds don't track the > later versions of Hadoop that vendors would be distributing -- there's > no Hadoop 2.6 build for example. CDH4 is here, but not the > far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't > actually work with many CDH4 versions. > > I agree with the goal of maximizing the reach of Spark, but I don't > know how much these builds advance that goal. > > Anyone can roll-their-own exactly-right build, and the docs and build > have been set up to make that as simple as can be expected. So these > aren't *required* to let me use latest Spark on distribution X. > > I had thought these existed to sorta support 'legacy' distributions, > like CDH4, and that build was justified as a > quasi-Hadoop-2.0.x-flavored build. But then I don't understand what > the MapR profiles are for. > > I think it's too much work to correctly, in parallel, maintain any > customizations necessary for any major distro, and it might be best to > do not at all than to do it incompletely. You could say it's also an > enabler for distros to vary in ways that require special > customization. > > Maybe there's a concern that, if lots of people consume Spark on > Hadoop, and most people consume Hadoop through distros, and distros > alone manage Spark distributions, then you de facto 'have to' go > through a distro instead of get bits from Spark? Different > conversation but I think this sort of effect does not end up being a > negative. > > Well anyway, I like the idea of seeing how far Hadoop-provided > releases can help. It might kill several birds with one stone. > > On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia > wrote: >> Our goal is to let people use the latest Apache release even if vendors fall >> behind or don't want to package everything, so that's why we put out >> releases for vendors' versions. It's fairly low overhead. >> >> Matei >> >>> On Mar 8, 2015, at 5:56 PM, Sean Owen wrote: >>> >>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball >>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the >>> Maven artifacts. >>> >>> Patrick I see you just commented on SPARK-5134 and will follow up >>> there. Sounds like this may accidentally not be a problem. >>> >>> On binary tarball releases, I wonder if anyone has an opinion on my >>> opinion that these shouldn't be distributed for specific Hadoop >>> *distributions* to begin with. (Won't repeat the argument here yet.) >>> That resolves this n x m explosion too. >>> >>> Vendors already provide their own distribution, yes, that's their job. >>> >>> >>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar wrote: Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop Distributions X ... May be one option is to have a minimum basic set (which I know is what we are discussing) and move the rest to spark-packages.org. There the vendors can add the latest downloads - for example when 1.4 is released, HDP can build a release of HDP Spark 1.4 bundle. Cheers On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell wrote: > > We probably want to revisit the way we do binaries in general for > 1.4+. IMO, something worth forking a separate thread for. > > I've been hesitating to add new binaries because people > (understandably) complain if you ever stop packaging older ones, but > on the other hand the ASF has complained that we have too many > binaries already and that we need to pare it down because of the large > volume of files. Doubling the number of binaries we produce for Scala > 2.11 seemed like it would be too much. > > One solution potentially is to actually package "Hadoop provided" > binaries and encourage users to use these by simply setting > HADOOP_HOME, or have instructions for specific distros. I've heard > that our existing packages don't work well on HDP for instance, since > there are some configuration quirks that differ from the upstream > Hadoop. > > If we cut down on the cross building for Hadoop versions, then it is > more tenable to cross build for Scala versions without exploding the > number of binaries. > > - Patrick > > On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: >> Yeah, interesting question of what is th
Re: Block Transfer Service encryption support
I think that yes, longer term we want to have encryption of all communicated data. However Jeff, can you open a JIRA to discuss the design before opening a pull request (it's fine to link to a WIP branch if you'd like)? I'd like to better understand the performance and operational complexity of using SSL for this in comparison with alternatives. It would also be good to look at how the Hadoop encryption works for their shuffle service, in terms of the design decisions made there. - Patrick On Sun, Mar 8, 2015 at 5:42 PM, Jeff Turpin wrote: > I have already written most of the code, just finishing up the unit tests > right now... > > Jeff > > > On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash wrote: > >> I'm interested in seeing this data transfer occurring over encrypted >> communication channels as well. Many customers require that all network >> transfer occur encrypted to prevent the "soft underbelly" that's often >> found inside a corporate network. >> >> On Fri, Mar 6, 2015 at 4:20 PM, turp1twin wrote: >> >>> Is there a plan to implement SSL support for the Block Transfer Service >>> (specifically, the NettyBlockTransferService implementation)? I can >>> volunteer if needed... >>> >>> Jeff >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html >>> Sent from the Apache Spark Developers List mailing list archive at >>> Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> >> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Block Transfer Service encryption support
I have already written most of the code, just finishing up the unit tests right now... Jeff On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash wrote: > I'm interested in seeing this data transfer occurring over encrypted > communication channels as well. Many customers require that all network > transfer occur encrypted to prevent the "soft underbelly" that's often > found inside a corporate network. > > On Fri, Mar 6, 2015 at 4:20 PM, turp1twin wrote: > >> Is there a plan to implement SSL support for the Block Transfer Service >> (specifically, the NettyBlockTransferService implementation)? I can >> volunteer if needed... >> >> Jeff >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >
Re: Block Transfer Service encryption support
I'm interested in seeing this data transfer occurring over encrypted communication channels as well. Many customers require that all network transfer occur encrypted to prevent the "soft underbelly" that's often found inside a corporate network. On Fri, Mar 6, 2015 at 4:20 PM, turp1twin wrote: > Is there a plan to implement SSL support for the Block Transfer Service > (specifically, the NettyBlockTransferService implementation)? I can > volunteer if needed... > > Jeff > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
Yeah it's not much overhead, but here's an example of where it causes a little issue. I like that reasoning. However, the released builds don't track the later versions of Hadoop that vendors would be distributing -- there's no Hadoop 2.6 build for example. CDH4 is here, but not the far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't actually work with many CDH4 versions. I agree with the goal of maximizing the reach of Spark, but I don't know how much these builds advance that goal. Anyone can roll-their-own exactly-right build, and the docs and build have been set up to make that as simple as can be expected. So these aren't *required* to let me use latest Spark on distribution X. I had thought these existed to sorta support 'legacy' distributions, like CDH4, and that build was justified as a quasi-Hadoop-2.0.x-flavored build. But then I don't understand what the MapR profiles are for. I think it's too much work to correctly, in parallel, maintain any customizations necessary for any major distro, and it might be best to do not at all than to do it incompletely. You could say it's also an enabler for distros to vary in ways that require special customization. Maybe there's a concern that, if lots of people consume Spark on Hadoop, and most people consume Hadoop through distros, and distros alone manage Spark distributions, then you de facto 'have to' go through a distro instead of get bits from Spark? Different conversation but I think this sort of effect does not end up being a negative. Well anyway, I like the idea of seeing how far Hadoop-provided releases can help. It might kill several birds with one stone. On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia wrote: > Our goal is to let people use the latest Apache release even if vendors fall > behind or don't want to package everything, so that's why we put out releases > for vendors' versions. It's fairly low overhead. > > Matei > >> On Mar 8, 2015, at 5:56 PM, Sean Owen wrote: >> >> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball >> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the >> Maven artifacts. >> >> Patrick I see you just commented on SPARK-5134 and will follow up >> there. Sounds like this may accidentally not be a problem. >> >> On binary tarball releases, I wonder if anyone has an opinion on my >> opinion that these shouldn't be distributed for specific Hadoop >> *distributions* to begin with. (Won't repeat the argument here yet.) >> That resolves this n x m explosion too. >> >> Vendors already provide their own distribution, yes, that's their job. >> >> >> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar wrote: >>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop >>> Distributions X ... >>> >>> May be one option is to have a minimum basic set (which I know is what we >>> are discussing) and move the rest to spark-packages.org. There the vendors >>> can add the latest downloads - for example when 1.4 is released, HDP can >>> build a release of HDP Spark 1.4 bundle. >>> >>> Cheers >>> >>> >>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell wrote: We probably want to revisit the way we do binaries in general for 1.4+. IMO, something worth forking a separate thread for. I've been hesitating to add new binaries because people (understandably) complain if you ever stop packaging older ones, but on the other hand the ASF has complained that we have too many binaries already and that we need to pare it down because of the large volume of files. Doubling the number of binaries we produce for Scala 2.11 seemed like it would be too much. One solution potentially is to actually package "Hadoop provided" binaries and encourage users to use these by simply setting HADOOP_HOME, or have instructions for specific distros. I've heard that our existing packages don't work well on HDP for instance, since there are some configuration quirks that differ from the upstream Hadoop. If we cut down on the cross building for Hadoop versions, then it is more tenable to cross build for Scala versions without exploding the number of binaries. - Patrick On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: > Yeah, interesting question of what is the better default for the > single set of artifacts published to Maven. I think there's an > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros > and cons discussed more at > > https://issues.apache.org/jira/browse/SPARK-5134 > https://github.com/apache/spark/pull/3917 > > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia > wrote: >> +1 >> >> Tested it on Mac OS X. >> >> One small issue I noticed is that the Scala 2.11 build is using Hadoop >> 1 without Hive, which is kind of weird because people will more likely >> want >> Hadoop
Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
I think it's important to separate the goals from the implementation. I agree with Matei on the goal - I think the goal needs to be to allow people to download Apache Spark and use it with CDH, HDP, MapR, whatever... This is the whole reason why HDFS and YARN have stable API's, so that other projects can build on them in a way that works across multiple versions. I wouldn't want to force users to upgrade according only to some vendor timetable, that doesn't seem from the ASF perspective like a good thing for the project. If users want to get packages from Bigtop, or the vendors, that's totally fine too. My point earlier was - I am not sure we are actually accomplishing that goal now, because I've heard in some cases our "Hadoop 2.X" packages actually don't work on certain distributions, even those that are based on that Hadoop version. So one solution is to move towards "bring your own Hadoop" binaries and have users just set HADOOP_HOME and maybe document any vendor-specific configs that need to be set. That also happens to solve the "too many binaries" problem, but only incidentally. - Patrick On Sun, Mar 8, 2015 at 4:07 PM, Matei Zaharia wrote: > Our goal is to let people use the latest Apache release even if vendors fall > behind or don't want to package everything, so that's why we put out releases > for vendors' versions. It's fairly low overhead. > > Matei > >> On Mar 8, 2015, at 5:56 PM, Sean Owen wrote: >> >> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball >> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the >> Maven artifacts. >> >> Patrick I see you just commented on SPARK-5134 and will follow up >> there. Sounds like this may accidentally not be a problem. >> >> On binary tarball releases, I wonder if anyone has an opinion on my >> opinion that these shouldn't be distributed for specific Hadoop >> *distributions* to begin with. (Won't repeat the argument here yet.) >> That resolves this n x m explosion too. >> >> Vendors already provide their own distribution, yes, that's their job. >> >> >> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar wrote: >>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop >>> Distributions X ... >>> >>> May be one option is to have a minimum basic set (which I know is what we >>> are discussing) and move the rest to spark-packages.org. There the vendors >>> can add the latest downloads - for example when 1.4 is released, HDP can >>> build a release of HDP Spark 1.4 bundle. >>> >>> Cheers >>> >>> >>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell wrote: We probably want to revisit the way we do binaries in general for 1.4+. IMO, something worth forking a separate thread for. I've been hesitating to add new binaries because people (understandably) complain if you ever stop packaging older ones, but on the other hand the ASF has complained that we have too many binaries already and that we need to pare it down because of the large volume of files. Doubling the number of binaries we produce for Scala 2.11 seemed like it would be too much. One solution potentially is to actually package "Hadoop provided" binaries and encourage users to use these by simply setting HADOOP_HOME, or have instructions for specific distros. I've heard that our existing packages don't work well on HDP for instance, since there are some configuration quirks that differ from the upstream Hadoop. If we cut down on the cross building for Hadoop versions, then it is more tenable to cross build for Scala versions without exploding the number of binaries. - Patrick On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: > Yeah, interesting question of what is the better default for the > single set of artifacts published to Maven. I think there's an > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros > and cons discussed more at > > https://issues.apache.org/jira/browse/SPARK-5134 > https://github.com/apache/spark/pull/3917 > > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia > wrote: >> +1 >> >> Tested it on Mac OS X. >> >> One small issue I noticed is that the Scala 2.11 build is using Hadoop >> 1 without Hive, which is kind of weird because people will more likely >> want >> Hadoop 2 with Hive. So it would be good to publish a build for that >> configuration instead. We can do it if we do a new RC, or it might be >> that >> binary builds may not need to be voted on (I forgot the details there). >> >> Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org >>> > - To unsubscribe
Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead. Matei > On Mar 8, 2015, at 5:56 PM, Sean Owen wrote: > > Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball > at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the > Maven artifacts. > > Patrick I see you just commented on SPARK-5134 and will follow up > there. Sounds like this may accidentally not be a problem. > > On binary tarball releases, I wonder if anyone has an opinion on my > opinion that these shouldn't be distributed for specific Hadoop > *distributions* to begin with. (Won't repeat the argument here yet.) > That resolves this n x m explosion too. > > Vendors already provide their own distribution, yes, that's their job. > > > On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar wrote: >> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop >> Distributions X ... >> >> May be one option is to have a minimum basic set (which I know is what we >> are discussing) and move the rest to spark-packages.org. There the vendors >> can add the latest downloads - for example when 1.4 is released, HDP can >> build a release of HDP Spark 1.4 bundle. >> >> Cheers >> >> >> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell wrote: >>> >>> We probably want to revisit the way we do binaries in general for >>> 1.4+. IMO, something worth forking a separate thread for. >>> >>> I've been hesitating to add new binaries because people >>> (understandably) complain if you ever stop packaging older ones, but >>> on the other hand the ASF has complained that we have too many >>> binaries already and that we need to pare it down because of the large >>> volume of files. Doubling the number of binaries we produce for Scala >>> 2.11 seemed like it would be too much. >>> >>> One solution potentially is to actually package "Hadoop provided" >>> binaries and encourage users to use these by simply setting >>> HADOOP_HOME, or have instructions for specific distros. I've heard >>> that our existing packages don't work well on HDP for instance, since >>> there are some configuration quirks that differ from the upstream >>> Hadoop. >>> >>> If we cut down on the cross building for Hadoop versions, then it is >>> more tenable to cross build for Scala versions without exploding the >>> number of binaries. >>> >>> - Patrick >>> >>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: Yeah, interesting question of what is the better default for the single set of artifacts published to Maven. I think there's an argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros and cons discussed more at https://issues.apache.org/jira/browse/SPARK-5134 https://github.com/apache/spark/pull/3917 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia wrote: > +1 > > Tested it on Mac OS X. > > One small issue I noticed is that the Scala 2.11 build is using Hadoop > 1 without Hive, which is kind of weird because people will more likely > want > Hadoop 2 with Hive. So it would be good to publish a build for that > configuration instead. We can do it if we do a new RC, or it might be that > binary builds may not need to be voted on (I forgot the details there). > > Matei >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the Maven artifacts. Patrick I see you just commented on SPARK-5134 and will follow up there. Sounds like this may accidentally not be a problem. On binary tarball releases, I wonder if anyone has an opinion on my opinion that these shouldn't be distributed for specific Hadoop *distributions* to begin with. (Won't repeat the argument here yet.) That resolves this n x m explosion too. Vendors already provide their own distribution, yes, that's their job. On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar wrote: > Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop > Distributions X ... > > May be one option is to have a minimum basic set (which I know is what we > are discussing) and move the rest to spark-packages.org. There the vendors > can add the latest downloads - for example when 1.4 is released, HDP can > build a release of HDP Spark 1.4 bundle. > > Cheers > > > On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell wrote: >> >> We probably want to revisit the way we do binaries in general for >> 1.4+. IMO, something worth forking a separate thread for. >> >> I've been hesitating to add new binaries because people >> (understandably) complain if you ever stop packaging older ones, but >> on the other hand the ASF has complained that we have too many >> binaries already and that we need to pare it down because of the large >> volume of files. Doubling the number of binaries we produce for Scala >> 2.11 seemed like it would be too much. >> >> One solution potentially is to actually package "Hadoop provided" >> binaries and encourage users to use these by simply setting >> HADOOP_HOME, or have instructions for specific distros. I've heard >> that our existing packages don't work well on HDP for instance, since >> there are some configuration quirks that differ from the upstream >> Hadoop. >> >> If we cut down on the cross building for Hadoop versions, then it is >> more tenable to cross build for Scala versions without exploding the >> number of binaries. >> >> - Patrick >> >> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: >> > Yeah, interesting question of what is the better default for the >> > single set of artifacts published to Maven. I think there's an >> > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros >> > and cons discussed more at >> > >> > https://issues.apache.org/jira/browse/SPARK-5134 >> > https://github.com/apache/spark/pull/3917 >> > >> > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia >> > wrote: >> >> +1 >> >> >> >> Tested it on Mac OS X. >> >> >> >> One small issue I noticed is that the Scala 2.11 build is using Hadoop >> >> 1 without Hive, which is kind of weird because people will more likely >> >> want >> >> Hadoop 2 with Hive. So it would be good to publish a build for that >> >> configuration instead. We can do it if we do a new RC, or it might be that >> >> binary builds may not need to be voted on (I forgot the details there). >> >> >> >> Matei >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop Distributions X ... May be one option is to have a minimum basic set (which I know is what we are discussing) and move the rest to spark-packages.org. There the vendors can add the latest downloads - for example when 1.4 is released, HDP can build a release of HDP Spark 1.4 bundle. Cheers On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell wrote: > We probably want to revisit the way we do binaries in general for > 1.4+. IMO, something worth forking a separate thread for. > > I've been hesitating to add new binaries because people > (understandably) complain if you ever stop packaging older ones, but > on the other hand the ASF has complained that we have too many > binaries already and that we need to pare it down because of the large > volume of files. Doubling the number of binaries we produce for Scala > 2.11 seemed like it would be too much. > > One solution potentially is to actually package "Hadoop provided" > binaries and encourage users to use these by simply setting > HADOOP_HOME, or have instructions for specific distros. I've heard > that our existing packages don't work well on HDP for instance, since > there are some configuration quirks that differ from the upstream > Hadoop. > > If we cut down on the cross building for Hadoop versions, then it is > more tenable to cross build for Scala versions without exploding the > number of binaries. > > - Patrick > > On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: > > Yeah, interesting question of what is the better default for the > > single set of artifacts published to Maven. I think there's an > > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros > > and cons discussed more at > > > > https://issues.apache.org/jira/browse/SPARK-5134 > > https://github.com/apache/spark/pull/3917 > > > > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia > wrote: > >> +1 > >> > >> Tested it on Mac OS X. > >> > >> One small issue I noticed is that the Scala 2.11 build is using Hadoop > 1 without Hive, which is kind of weird because people will more likely want > Hadoop 2 with Hive. So it would be good to publish a build for that > configuration instead. We can do it if we do a new RC, or it might be that > binary builds may not need to be voted on (I forgot the details there). > >> > >> Matei > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
We probably want to revisit the way we do binaries in general for 1.4+. IMO, something worth forking a separate thread for. I've been hesitating to add new binaries because people (understandably) complain if you ever stop packaging older ones, but on the other hand the ASF has complained that we have too many binaries already and that we need to pare it down because of the large volume of files. Doubling the number of binaries we produce for Scala 2.11 seemed like it would be too much. One solution potentially is to actually package "Hadoop provided" binaries and encourage users to use these by simply setting HADOOP_HOME, or have instructions for specific distros. I've heard that our existing packages don't work well on HDP for instance, since there are some configuration quirks that differ from the upstream Hadoop. If we cut down on the cross building for Hadoop versions, then it is more tenable to cross build for Scala versions without exploding the number of binaries. - Patrick On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: > Yeah, interesting question of what is the better default for the > single set of artifacts published to Maven. I think there's an > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros > and cons discussed more at > > https://issues.apache.org/jira/browse/SPARK-5134 > https://github.com/apache/spark/pull/3917 > > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia wrote: >> +1 >> >> Tested it on Mac OS X. >> >> One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 >> without Hive, which is kind of weird because people will more likely want >> Hadoop 2 with Hive. So it would be good to publish a build for that >> configuration instead. We can do it if we do a new RC, or it might be that >> binary builds may not need to be voted on (I forgot the details there). >> >> Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
Yeah, interesting question of what is the better default for the single set of artifacts published to Maven. I think there's an argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros and cons discussed more at https://issues.apache.org/jira/browse/SPARK-5134 https://github.com/apache/spark/pull/3917 On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia wrote: > +1 > > Tested it on Mac OS X. > > One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 > without Hive, which is kind of weird because people will more likely want > Hadoop 2 with Hive. So it would be good to publish a build for that > configuration instead. We can do it if we do a new RC, or it might be that > binary builds may not need to be voted on (I forgot the details there). > > Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
+1 Tested it on Mac OS X. One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 without Hive, which is kind of weird because people will more likely want Hadoop 2 with Hive. So it would be good to publish a build for that configuration instead. We can do it if we do a new RC, or it might be that binary builds may not need to be voted on (I forgot the details there). Matei > On Mar 5, 2015, at 9:52 PM, Patrick Wendell wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.3.0! > > The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-1.3.0-rc3/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > Staging repositories for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1078 > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/ > > Please vote on releasing this package as Apache Spark 1.3.0! > > The vote is open until Monday, March 09, at 02:52 UTC and passes if > a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.3.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see > http://spark.apache.org/ > > == How does this compare to RC2 == > This release includes the following bug fixes: > > https://issues.apache.org/jira/browse/SPARK-6144 > https://issues.apache.org/jira/browse/SPARK-6171 > https://issues.apache.org/jira/browse/SPARK-5143 > https://issues.apache.org/jira/browse/SPARK-6182 > https://issues.apache.org/jira/browse/SPARK-6175 > > == How can I help test this release? == > If you are a Spark user, you can help us test this release by > taking a Spark 1.2 workload and running on this release candidate, > then reporting any regressions. > > If you are happy with this release based on your own testing, give a +1 vote. > > == What justifies a -1 vote for this release? == > This vote is happening towards the end of the 1.3 QA period, > so -1 votes should only occur for significant regressions from 1.2.1. > Bugs already present in 1.2.X, minor regressions, or bugs related > to new features will not block this release. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Loading previously serialized object to Spark
Can you paste the complete code? Thanks Best Regards On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander wrote: > Hi, > > I've implemented class MyClass in MLlib that does some operation on > LabeledPoint. MyClass extends serializable, so I can map this operation on > data of RDD[LabeledPoints], such as data.map(lp => MyClass.operate(lp)). I > write this class in file with ObjectOutputStream.writeObject. Then I stop > and restart Spark. I load this class from file with > ObjectInputStream.readObject.asInstanceOf[MyClass]. When I try to map the > same operation of this class to RDD, Spark throws not serializable > exception: > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) > at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1453) > at org.apache.spark.rdd.RDD.map(RDD.scala:273) > > Could you suggest why it throws this exception while MyClass is > serializable by definition? > > Best regards, Alexander >