Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
In ideal situation, +1 on removing all vendor specific builds and making just hadoop version specific - that is what we should depend on anyway. Though I hope Sean is correct in assuming that vendor specific builds for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause incompatibilities for us or our users ! Regards, Mridul On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen so...@cloudera.com wrote: Yes, you should always find working bits at Apache no matter what -- though 'no matter what' really means 'as long as you use Hadoop distro compatible with upstream Hadoop'. Even distros have a strong interest in that, since the market, the 'pie', is made large by this kind of freedom at the core. If tso, then no vendor-specific builds are needed, only some Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be good (although I'm not yet clear if there's something about 2.5 or 2.6 that needs a different build.) I take it that we already believe that, say, the Hadoop 2.4 build works with CDH5, so no CDH5-specific build is provided by Spark. If a distro doesn't work with stock Spark, then it's either something Spark should fix (e.g. use of a private YARN API or something), or it's something the distro should really fix because it's incompatible. Could we maybe rename the CDH4 build then, as it doesn't really work with all CDH4, to be a Hadoop 2.0.x build? That's been floated before. And can we remove the MapR builds -- or else can someone explain why these exist separately from a Hadoop 2.3 build? I hope it is not *because* they are somehow non-standard. And shall we first run down why Spark doesn't fully work on HDP and see if it's something that Spark or HDP needs to tweak, rather than contemplate another binary? or, if so, can it simply be called a Hadoop 2.7 + YARN whatever build and not made specific to a vendor, even if the project has to field another tarball combo for a vendor? Maybe we are saying almost the same thing. On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, my concern is that people should get Apache Spark from *Apache*, not from a vendor. It helps everyone use the latest features no matter where they are. In the Hadoop distro case, Hadoop made all this effort to have standard APIs (e.g. YARN), so it should be easy. But it is a problem if we're not packaging for the newest versions of some distros; I think we just fell behind at Hadoop 2.4. Matei On Mar 8, 2015, at 8:02 PM, Sean Owen so...@cloudera.com wrote: Yeah it's not much overhead, but here's an example of where it causes a little issue. I like that reasoning. However, the released builds don't track the later versions of Hadoop that vendors would be distributing -- there's no Hadoop 2.6 build for example. CDH4 is here, but not the far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't actually work with many CDH4 versions. I agree with the goal of maximizing the reach of Spark, but I don't know how much these builds advance that goal. Anyone can roll-their-own exactly-right build, and the docs and build have been set up to make that as simple as can be expected. So these aren't *required* to let me use latest Spark on distribution X. I had thought these existed to sorta support 'legacy' distributions, like CDH4, and that build was justified as a quasi-Hadoop-2.0.x-flavored build. But then I don't understand what the MapR profiles are for. I think it's too much work to correctly, in parallel, maintain any customizations necessary for any major distro, and it might be best to do not at all than to do it incompletely. You could say it's also an enabler for distros to vary in ways that require special customization. Maybe there's a concern that, if lots of people consume Spark on Hadoop, and most people consume Hadoop through distros, and distros alone manage Spark distributions, then you de facto 'have to' go through a distro instead of get bits from Spark? Different conversation but I think this sort of effect does not end up being a negative. Well anyway, I like the idea of seeing how far Hadoop-provided releases can help. It might kill several birds with one stone. On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead. Matei On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote: Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the Maven artifacts. Patrick I see you just commented on SPARK-5134 and will follow up there. Sounds like this may accidentally not be a problem. On binary tarball releases, I wonder if anyone has an opinion on my opinion that
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
Hey All, Today there was a JIRA posted with an observed regression around Spark Streaming during certain recovery scenarios: https://issues.apache.org/jira/browse/SPARK-6222 My preference is to go ahead and ship this release (RC3) as-is and if this issue is isolated resolved soon, we can make a patch release in the next week or two. At some point, the cost of continuing to hold the release re/vote is so high that it's better to just ship the release. We can document known issues and point users to a fix once it's available. We did this in 1.2.0 as well (there were two small known issues) and I think as a point of process, this approach is necessary given the size of the project. I wanted to notify this thread though, in case this change anyones opinion on their release vote. I will leave the thread open at least until the end of today. Still +1 on RC3, for me. - Patrick On Mon, Mar 9, 2015 at 9:36 AM, Denny Lee denny.g@gmail.com wrote: +1 (non-binding) Spark Standalone and YARN on Hadoop 2.6 on OSX plus various tests (MLLib, SparkSQL, etc.) On Mon, Mar 9, 2015 at 9:18 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Built from source and ran Spark on yarn on hadoop 2.6 in cluster and client mode. Tom On Thursday, March 5, 2015 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1078 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Monday, March 09, at 02:52 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC2 == This release includes the following bug fixes: https://issues.apache.org/jira/browse/SPARK-6144 https://issues.apache.org/jira/browse/SPARK-6171 https://issues.apache.org/jira/browse/SPARK-5143 https://issues.apache.org/jira/browse/SPARK-6182 https://issues.apache.org/jira/browse/SPARK-6175 == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
Does the Apache project team have any ability to measure download counts of the various releases? That data could be useful when it comes time to sunset vendor-specific releases, like CDH4 for example. On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan mri...@gmail.com wrote: In ideal situation, +1 on removing all vendor specific builds and making just hadoop version specific - that is what we should depend on anyway. Though I hope Sean is correct in assuming that vendor specific builds for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause incompatibilities for us or our users ! Regards, Mridul On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen so...@cloudera.com wrote: Yes, you should always find working bits at Apache no matter what -- though 'no matter what' really means 'as long as you use Hadoop distro compatible with upstream Hadoop'. Even distros have a strong interest in that, since the market, the 'pie', is made large by this kind of freedom at the core. If tso, then no vendor-specific builds are needed, only some Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be good (although I'm not yet clear if there's something about 2.5 or 2.6 that needs a different build.) I take it that we already believe that, say, the Hadoop 2.4 build works with CDH5, so no CDH5-specific build is provided by Spark. If a distro doesn't work with stock Spark, then it's either something Spark should fix (e.g. use of a private YARN API or something), or it's something the distro should really fix because it's incompatible. Could we maybe rename the CDH4 build then, as it doesn't really work with all CDH4, to be a Hadoop 2.0.x build? That's been floated before. And can we remove the MapR builds -- or else can someone explain why these exist separately from a Hadoop 2.3 build? I hope it is not *because* they are somehow non-standard. And shall we first run down why Spark doesn't fully work on HDP and see if it's something that Spark or HDP needs to tweak, rather than contemplate another binary? or, if so, can it simply be called a Hadoop 2.7 + YARN whatever build and not made specific to a vendor, even if the project has to field another tarball combo for a vendor? Maybe we are saying almost the same thing. On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, my concern is that people should get Apache Spark from *Apache*, not from a vendor. It helps everyone use the latest features no matter where they are. In the Hadoop distro case, Hadoop made all this effort to have standard APIs (e.g. YARN), so it should be easy. But it is a problem if we're not packaging for the newest versions of some distros; I think we just fell behind at Hadoop 2.4. Matei On Mar 8, 2015, at 8:02 PM, Sean Owen so...@cloudera.com wrote: Yeah it's not much overhead, but here's an example of where it causes a little issue. I like that reasoning. However, the released builds don't track the later versions of Hadoop that vendors would be distributing -- there's no Hadoop 2.6 build for example. CDH4 is here, but not the far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't actually work with many CDH4 versions. I agree with the goal of maximizing the reach of Spark, but I don't know how much these builds advance that goal. Anyone can roll-their-own exactly-right build, and the docs and build have been set up to make that as simple as can be expected. So these aren't *required* to let me use latest Spark on distribution X. I had thought these existed to sorta support 'legacy' distributions, like CDH4, and that build was justified as a quasi-Hadoop-2.0.x-flavored build. But then I don't understand what the MapR profiles are for. I think it's too much work to correctly, in parallel, maintain any customizations necessary for any major distro, and it might be best to do not at all than to do it incompletely. You could say it's also an enabler for distros to vary in ways that require special customization. Maybe there's a concern that, if lots of people consume Spark on Hadoop, and most people consume Hadoop through distros, and distros alone manage Spark distributions, then you de facto 'have to' go through a distro instead of get bits from Spark? Different conversation but I think this sort of effect does not end up being a negative. Well anyway, I like the idea of seeing how far Hadoop-provided releases can help. It might kill several birds with one stone. On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead. Matei On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote: Ah.
RE: Using CUDA within Spark / boosting linear algebra
Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) It seems like this is going to be somewhat exploratory for a while (and there's probably only a handful of us
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2. This is not a problem if both run until convergence (if not blow up). However, in your example, a very small step size is chosen and it didn't converge in 100 iterations. In this case, the step size matters. I will put a note in the migration guide. Thanks! -Xiangrui On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote: I'm +1 as I have not heard of any one else seeing the Hive test failure, which is likely a test issue rather than code issue anyway, and not a blocker. On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote: Although the problem is small, especially if indeed the essential docs changes are following just a couple days behind the final release, I mean, why the rush if they're essential? wait a couple days, finish them, make the release. Answer is, I think these changes aren't actually essential given the comment from tdas, so: just mark these Critical? (although ... they do say they're changes for the 1.3 release, so kind of funny to get to them for 1.3.x or 1.4, but that's not important now.) I thought that Blocker really meant Blocker in this project, as I've been encouraged to use it to mean don't release without this. I think we should use it that way. Just thinking of it as extra Critical doesn't add anything. I don't think Documentation should be special-cased as less important, and I don't think there's confusion if Blocker means what it says, so I'd 'fix' that way. If nobody sees the Hive failure I observed, and if we can just zap those Blockers one way or the other, +1 On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: Sean, The docs are distributed and consumed in a fundamentally different way than Spark code itself. So we've always considered the deadline for doc changes to be when the release is finally posted. If there are small inconsistencies with the docs present in the source code for that release tag, IMO that doesn't matter much since we don't even distribute the docs with Spark's binary releases and virtually no one builds and hosts the docs on their own (that I am aware of, at least). Perhaps we can recommend if people want to build the doc sources that they should always grab the head of the most recent release branch, to set expectations accordingly. In the past we haven't considered it worth holding up the release process for the purpose of the docs. It just doesn't make sense since they are consumed as a service. If we decide to change this convention, it would mean shipping our releases later, since we could't pipeline the doc finalization with voting. - Patrick On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote: Given the title and tagging, it sounds like there could be some must-have doc changes to go with what is being released as 1.3. It can be finished later, and published later, but then the docs source shipped with the release doesn't match the site, and until then, 1.3 is released without some must-have docs for 1.3 on the site. The real question to me is: are there any further, absolutely essential doc changes that need to accompany 1.3 or not? If not, just resolve these. If there are, then it seems like the release has to block on them. If there are some docs that should have gone in for 1.3, but didn't, but aren't essential, well I suppose it bears thinking about how to not slip as much work, but it doesn't block. I think Documentation issues certainly can be a blocker and shouldn't be specially ignored. BTW the UISeleniumSuite issue is a real failure, but I do not think it is serious: http://issues.apache.org/jira/browse/SPARK-6205 It isn't a regression from 1.2.x, but only affects tests, and only affects a subset of build profiles. On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, SPARK-5310 Update SQL programming guide for 1.3 SPARK-5183 Document data source API SPARK-6128 Update Spark Streaming Guide for Spark 1.3 For these, the issue is that they are documentation JIRA's, which don't need to be timed exactly with the release vote, since we can update the documentation on the website whenever we want. In the past I've just mentally filtered these out when considering RC's. I see a few options here: 1. We downgrade such issues away from Blocker (more clear, but we risk loosing them in the fray if they really are things we want to have before the release is posted). 2. We provide a filter to the community that excludes 'Documentation' issues and shows all other blockers for 1.3. We can put this on the wiki, for instance. Which do you prefer? - Patrick
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
+1 (non-binding) - Verified signatures - Built on Mac OS X and Fedora 21. On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar ksanka...@gmail.com wrote: Excellent, Thanks Xiangrui. The mystery is solved. Cheers k/ On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote: Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2. This is not a problem if both run until convergence (if not blow up). However, in your example, a very small step size is chosen and it didn't converge in 100 iterations. In this case, the step size matters. I will put a note in the migration guide. Thanks! -Xiangrui On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote: I'm +1 as I have not heard of any one else seeing the Hive test failure, which is likely a test issue rather than code issue anyway, and not a blocker. On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote: Although the problem is small, especially if indeed the essential docs changes are following just a couple days behind the final release, I mean, why the rush if they're essential? wait a couple days, finish them, make the release. Answer is, I think these changes aren't actually essential given the comment from tdas, so: just mark these Critical? (although ... they do say they're changes for the 1.3 release, so kind of funny to get to them for 1.3.x or 1.4, but that's not important now.) I thought that Blocker really meant Blocker in this project, as I've been encouraged to use it to mean don't release without this. I think we should use it that way. Just thinking of it as extra Critical doesn't add anything. I don't think Documentation should be special-cased as less important, and I don't think there's confusion if Blocker means what it says, so I'd 'fix' that way. If nobody sees the Hive failure I observed, and if we can just zap those Blockers one way or the other, +1 On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: Sean, The docs are distributed and consumed in a fundamentally different way than Spark code itself. So we've always considered the deadline for doc changes to be when the release is finally posted. If there are small inconsistencies with the docs present in the source code for that release tag, IMO that doesn't matter much since we don't even distribute the docs with Spark's binary releases and virtually no one builds and hosts the docs on their own (that I am aware of, at least). Perhaps we can recommend if people want to build the doc sources that they should always grab the head of the most recent release branch, to set expectations accordingly. In the past we haven't considered it worth holding up the release process for the purpose of the docs. It just doesn't make sense since they are consumed as a service. If we decide to change this convention, it would mean shipping our releases later, since we could't pipeline the doc finalization with voting. - Patrick On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote: Given the title and tagging, it sounds like there could be some must-have doc changes to go with what is being released as 1.3. It can be finished later, and published later, but then the docs source shipped with the release doesn't match the site, and until then, 1.3 is released without some must-have docs for 1.3 on the site. The real question to me is: are there any further, absolutely essential doc changes that need to accompany 1.3 or not? If not, just resolve these. If there are, then it seems like the release has to block on them. If there are some docs that should have gone in for 1.3, but didn't, but aren't essential, well I suppose it bears thinking about how to not slip as much work, but it doesn't block. I think Documentation issues certainly can be a blocker and shouldn't be specially ignored. BTW the UISeleniumSuite issue is a real failure, but I do not think it is serious: http://issues.apache.org/jira/browse/SPARK-6205 It isn't a regression from 1.2.x, but only affects tests, and only affects a subset of build profiles. On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, SPARK-5310 Update SQL programming guide for 1.3 SPARK-5183 Document data source API SPARK-6128 Update Spark Streaming Guide for Spark 1.3 For these, the issue is that they are documentation JIRA's, which don't need to be timed exactly with the release vote, since we can update the documentation on the
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
+1 Tested on Mac OS X On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote: Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2. This is not a problem if both run until convergence (if not blow up). However, in your example, a very small step size is chosen and it didn't converge in 100 iterations. In this case, the step size matters. I will put a note in the migration guide. Thanks! -Xiangrui On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote: I'm +1 as I have not heard of any one else seeing the Hive test failure, which is likely a test issue rather than code issue anyway, and not a blocker. On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote: Although the problem is small, especially if indeed the essential docs changes are following just a couple days behind the final release, I mean, why the rush if they're essential? wait a couple days, finish them, make the release. Answer is, I think these changes aren't actually essential given the comment from tdas, so: just mark these Critical? (although ... they do say they're changes for the 1.3 release, so kind of funny to get to them for 1.3.x or 1.4, but that's not important now.) I thought that Blocker really meant Blocker in this project, as I've been encouraged to use it to mean don't release without this. I think we should use it that way. Just thinking of it as extra Critical doesn't add anything. I don't think Documentation should be special-cased as less important, and I don't think there's confusion if Blocker means what it says, so I'd 'fix' that way. If nobody sees the Hive failure I observed, and if we can just zap those Blockers one way or the other, +1 On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: Sean, The docs are distributed and consumed in a fundamentally different way than Spark code itself. So we've always considered the deadline for doc changes to be when the release is finally posted. If there are small inconsistencies with the docs present in the source code for that release tag, IMO that doesn't matter much since we don't even distribute the docs with Spark's binary releases and virtually no one builds and hosts the docs on their own (that I am aware of, at least). Perhaps we can recommend if people want to build the doc sources that they should always grab the head of the most recent release branch, to set expectations accordingly. In the past we haven't considered it worth holding up the release process for the purpose of the docs. It just doesn't make sense since they are consumed as a service. If we decide to change this convention, it would mean shipping our releases later, since we could't pipeline the doc finalization with voting. - Patrick On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote: Given the title and tagging, it sounds like there could be some must-have doc changes to go with what is being released as 1.3. It can be finished later, and published later, but then the docs source shipped with the release doesn't match the site, and until then, 1.3 is released without some must-have docs for 1.3 on the site. The real question to me is: are there any further, absolutely essential doc changes that need to accompany 1.3 or not? If not, just resolve these. If there are, then it seems like the release has to block on them. If there are some docs that should have gone in for 1.3, but didn't, but aren't essential, well I suppose it bears thinking about how to not slip as much work, but it doesn't block. I think Documentation issues certainly can be a blocker and shouldn't be specially ignored. BTW the UISeleniumSuite issue is a real failure, but I do not think it is serious: http://issues.apache.org/jira/browse/SPARK-6205 It isn't a regression from 1.2.x, but only affects tests, and only affects a subset of build profiles. On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, SPARK-5310 Update SQL programming guide for 1.3 SPARK-5183 Document data source API SPARK-6128 Update Spark Streaming Guide for Spark 1.3 For these, the issue is that they are documentation JIRA's, which don't need to be timed exactly with the release vote, since we can update the documentation on the website whenever we want. In the past I've just mentally filtered these out when considering RC's. I see a few options here: 1. We downgrade such issues away from Blocker (more clear, but we risk loosing them in the fray if they really are things we want to have before the release is posted). 2.
RE: Using CUDA within Spark / boosting linear algebra
Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re:
Re: enum-like types in Spark
Can you expand on the serde issues w/ java enum's at all? I haven't heard of any problems specific to enums. The java object serialization rules seem very clear and it doesn't seem like different jvms should have a choice on what they do: http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469 (in a nutshell, serialization must use enum.name()) of course there are plenty of ways the user could screw this up(eg. rename the enums, or change their meaning, or remove them). But then again, all of java serialization has issues w/ serialization the user has to be aware of. Eg., if we go with case objects, than java serialization blows up if you add another helper method, even if that helper method is completely compatible. Some prior debate in the scala community: https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ SO post on which version to use in scala: http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types SO post about the macro-craziness people try to add to scala to make them almost as good as a simple java enum: (NB: the accepted answer doesn't actually work in all cases ...) http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched Another proposal to add better enums built into scala ... but seems to be dormant: https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan mri...@gmail.com wrote: I have a strong dislike for java enum's due to the fact that they are not stable across JVM's - if it undergoes serde, you end up with unpredictable results at times [1]. One of the reasons why we prevent enum's from being key : though it is highly possible users might depend on it internally and shoot themselves in the foot. Would be better to keep away from them in general and use something more stable. Regards, Mridul [1] Having had to debug this issue for 2 weeks - I really really hate it. On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid iras...@cloudera.com wrote: I have a very strong dislike for #1 (scala enumerations). I'm ok with #4 (with Xiangrui's final suggestion, especially making it sealed available in Java), but I really think #2, java enums, are the best option. Java enums actually have some very real advantages over the other approaches -- you get values(), valueOf(), EnumSet, and EnumMap. There has been endless debate in the Scala community about the problems with the approaches in Scala. Very smart, level-headed Scala gurus have complained about their short-comings (Rex Kerr's name is coming to mind, though I'm not positive about that); there have been numerous well-thought out proposals to give Scala a better enum. But the powers-that-be in Scala always reject them. IIRC the explanation for rejecting is basically that (a) enums aren't important enough for introducing some new special feature, scala's got bigger things to work on and (b) if you really need a good enum, just use java's enum. I doubt it really matters that much for Spark internals, which is why I think #4 is fine. But I figured I'd give my spiel, because every developer loves language wars :) Imran On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote: `case object` inside an `object` doesn't show up in Java. This is the minimal code I found to make everything show up correctly in both Scala and Java: sealed abstract class StorageLevel // cannot be a trait object StorageLevel { private[this] case object _MemoryOnly extends StorageLevel final val MemoryOnly: StorageLevel = _MemoryOnly private[this] case object _DiskOnly extends StorageLevel final val DiskOnly: StorageLevel = _DiskOnly } On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell pwend...@gmail.com wrote: I like #4 as well and agree with Aaron's suggestion. - Patrick On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson ilike...@gmail.com wrote: I'm cool with #4 as well, but make sure we dictate that the values should be defined within an object with the same name as the enumeration (like we do for StorageLevel). Otherwise we may pollute a higher namespace. e.g. we SHOULD do: trait StorageLevel object StorageLevel { case object MemoryOnly extends StorageLevel case object DiskOnly extends StorageLevel } On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust mich...@databricks.com wrote: #4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html
Re: [VOTE] Release Apache Spark 1.3.0 (RC3)
Excellent, Thanks Xiangrui. The mystery is solved. Cheers k/ On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote: Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2. This is not a problem if both run until convergence (if not blow up). However, in your example, a very small step size is chosen and it didn't converge in 100 iterations. In this case, the step size matters. I will put a note in the migration guide. Thanks! -Xiangrui On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote: I'm +1 as I have not heard of any one else seeing the Hive test failure, which is likely a test issue rather than code issue anyway, and not a blocker. On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote: Although the problem is small, especially if indeed the essential docs changes are following just a couple days behind the final release, I mean, why the rush if they're essential? wait a couple days, finish them, make the release. Answer is, I think these changes aren't actually essential given the comment from tdas, so: just mark these Critical? (although ... they do say they're changes for the 1.3 release, so kind of funny to get to them for 1.3.x or 1.4, but that's not important now.) I thought that Blocker really meant Blocker in this project, as I've been encouraged to use it to mean don't release without this. I think we should use it that way. Just thinking of it as extra Critical doesn't add anything. I don't think Documentation should be special-cased as less important, and I don't think there's confusion if Blocker means what it says, so I'd 'fix' that way. If nobody sees the Hive failure I observed, and if we can just zap those Blockers one way or the other, +1 On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: Sean, The docs are distributed and consumed in a fundamentally different way than Spark code itself. So we've always considered the deadline for doc changes to be when the release is finally posted. If there are small inconsistencies with the docs present in the source code for that release tag, IMO that doesn't matter much since we don't even distribute the docs with Spark's binary releases and virtually no one builds and hosts the docs on their own (that I am aware of, at least). Perhaps we can recommend if people want to build the doc sources that they should always grab the head of the most recent release branch, to set expectations accordingly. In the past we haven't considered it worth holding up the release process for the purpose of the docs. It just doesn't make sense since they are consumed as a service. If we decide to change this convention, it would mean shipping our releases later, since we could't pipeline the doc finalization with voting. - Patrick On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote: Given the title and tagging, it sounds like there could be some must-have doc changes to go with what is being released as 1.3. It can be finished later, and published later, but then the docs source shipped with the release doesn't match the site, and until then, 1.3 is released without some must-have docs for 1.3 on the site. The real question to me is: are there any further, absolutely essential doc changes that need to accompany 1.3 or not? If not, just resolve these. If there are, then it seems like the release has to block on them. If there are some docs that should have gone in for 1.3, but didn't, but aren't essential, well I suppose it bears thinking about how to not slip as much work, but it doesn't block. I think Documentation issues certainly can be a blocker and shouldn't be specially ignored. BTW the UISeleniumSuite issue is a real failure, but I do not think it is serious: http://issues.apache.org/jira/browse/SPARK-6205 It isn't a regression from 1.2.x, but only affects tests, and only affects a subset of build profiles. On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, SPARK-5310 Update SQL programming guide for 1.3 SPARK-5183 Document data source API SPARK-6128 Update Spark Streaming Guide for Spark 1.3 For these, the issue is that they are documentation JIRA's, which don't need to be timed exactly with the release vote, since we can update the documentation on the website whenever we want. In the past I've just mentally filtered these out when considering RC's. I see a few options here: 1. We downgrade such issues away from Blocker (more clear, but we risk loosing them in the fray if they really are things we want to
Re: missing explanation of cache in the documentation of cluster overview
It's explained at https://spark.apache.org/docs/latest/programming-guide.html and it's configuration at https://spark.apache.org/docs/latest/configuration.html Have a read over all the docs first. On Mon, Mar 9, 2015 at 9:24 AM, Hui WANG hedonp...@gmail.com wrote: Hello Guys, I'm reading the documentation of cluster mode overview on https://spark.apache.org/docs/latest/cluster-overview.html. In the schema, cache is shown aside executor but no explanation is done on it. Can someone please help to explain it and improve this page ? -- Hui WANG Tel : +33 (0) 6 71 33 45 39 Blog : http://www.hui-wang.info - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org