date:20150309

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Mridul Muralidharan

In ideal situation, +1 on removing all vendor specific builds and
making just hadoop version specific - that is what we should depend on
anyway.
Though I hope Sean is correct in assuming that vendor specific builds
for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause
incompatibilities for us or our users !

Regards,
Mridul


On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen so...@cloudera.com wrote:
 Yes, you should always find working bits at Apache no matter what --
 though 'no matter what' really means 'as long as you use Hadoop distro
 compatible with upstream Hadoop'. Even distros have a strong interest
 in that, since the market, the 'pie', is made large by this kind of
 freedom at the core.

 If tso, then no vendor-specific builds are needed, only some
 Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
 good (although I'm not yet clear if there's something about 2.5 or 2.6
 that needs a different build.)

 I take it that we already believe that, say, the Hadoop 2.4 build
 works with CDH5, so no CDH5-specific build is provided by Spark.

 If a distro doesn't work with stock Spark, then it's either something
 Spark should fix (e.g. use of a private YARN API or something), or
 it's something the distro should really fix because it's incompatible.

 Could we maybe rename the CDH4 build then, as it doesn't really work
 with all CDH4, to be a Hadoop 2.0.x build? That's been floated
 before. And can we remove the MapR builds -- or else can someone
 explain why these exist separately from a Hadoop 2.3 build? I hope it
 is not *because* they are somehow non-standard. And shall we first run
 down why Spark doesn't fully work on HDP and see if it's something
 that Spark or HDP needs to tweak, rather than contemplate another
 binary? or, if so, can it simply be called a Hadoop 2.7 + YARN
 whatever build and not made specific to a vendor, even if the project
 has to field another tarball combo for a vendor?

 Maybe we are saying almost the same thing.


 On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Yeah, my concern is that people should get Apache Spark from *Apache*, not 
 from a vendor. It helps everyone use the latest features no matter where 
 they are. In the Hadoop distro case, Hadoop made all this effort to have 
 standard APIs (e.g. YARN), so it should be easy. But it is a problem if 
 we're not packaging for the newest versions of some distros; I think we just 
 fell behind at Hadoop 2.4.

 Matei

 On Mar 8, 2015, at 8:02 PM, Sean Owen so...@cloudera.com wrote:

 Yeah it's not much overhead, but here's an example of where it causes
 a little issue.

 I like that reasoning. However, the released builds don't track the
 later versions of Hadoop that vendors would be distributing -- there's
 no Hadoop 2.6 build for example. CDH4 is here, but not the
 far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
 actually work with many CDH4 versions.

 I agree with the goal of maximizing the reach of Spark, but I don't
 know how much these builds advance that goal.

 Anyone can roll-their-own exactly-right build, and the docs and build
 have been set up to make that as simple as can be expected. So these
 aren't *required* to let me use latest Spark on distribution X.

 I had thought these existed to sorta support 'legacy' distributions,
 like CDH4, and that build was justified as a
 quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
 the MapR profiles are for.

 I think it's too much work to correctly, in parallel, maintain any
 customizations necessary for any major distro, and it might be best to
 do not at all than to do it incompletely. You could say it's also an
 enabler for distros to vary in ways that require special
 customization.

 Maybe there's a concern that, if lots of people consume Spark on
 Hadoop, and most people consume Hadoop through distros, and distros
 alone manage Spark distributions, then you de facto 'have to' go
 through a distro instead of get bits from Spark? Different
 conversation but I think this sort of effect does not end up being a
 negative.

 Well anyway, I like the idea of seeing how far Hadoop-provided
 releases can help. It might kill several birds with one stone.

 On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Our goal is to let people use the latest Apache release even if vendors 
 fall behind or don't want to package everything, so that's why we put out 
 releases for vendors' versions. It's fairly low overhead.

 Matei

 On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:

 Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
 at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
 Maven artifacts.

 Patrick I see you just commented on SPARK-5134 and will follow up
 there. Sounds like this may accidentally not be a problem.

 On binary tarball releases, I wonder if anyone has an opinion on my
 opinion that

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Patrick Wendell

Hey All,

Today there was a JIRA posted with an observed regression around Spark
Streaming during certain recovery scenarios:

https://issues.apache.org/jira/browse/SPARK-6222

My preference is to go ahead and ship this release (RC3) as-is and if
this issue is isolated resolved soon, we can make a patch release in
the next week or two.

At some point, the cost of continuing to hold the release re/vote is
so high that it's better to just ship the release. We can document
known issues and point users to a fix once it's available. We did this
in 1.2.0 as well (there were two small known issues) and I think as a
point of process, this approach is necessary given the size of the
project.

I wanted to notify this thread though, in case this change anyones
opinion on their release vote. I will leave the thread open at least
until the end of today.

Still +1 on RC3, for me.

- Patrick

On Mon, Mar 9, 2015 at 9:36 AM, Denny Lee denny.g@gmail.com wrote:
 +1 (non-binding)

 Spark Standalone and YARN on Hadoop 2.6 on OSX plus various tests (MLLib,
 SparkSQL, etc.)

 On Mon, Mar 9, 2015 at 9:18 AM Tom Graves tgraves...@yahoo.com.invalid
 wrote:

 +1. Built from source and ran Spark on yarn on hadoop 2.6 in cluster and
 client mode.
 Tom

  On Thursday, March 5, 2015 8:53 PM, Patrick Wendell
 pwend...@gmail.com wrote:


  Please vote on releasing the following candidate as Apache Spark version
 1.3.0!

 The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1078

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Monday, March 09, at 02:52 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How does this compare to RC2 ==
 This release includes the following bug fixes:

 https://issues.apache.org/jira/browse/SPARK-6144
 https://issues.apache.org/jira/browse/SPARK-6171
 https://issues.apache.org/jira/browse/SPARK-5143
 https://issues.apache.org/jira/browse/SPARK-6182
 https://issues.apache.org/jira/browse/SPARK-6175

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 If you are happy with this release based on your own testing, give a +1
 vote.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-09 Thread Andrew Ash

Does the Apache project team have any ability to measure download counts of
the various releases?  That data could be useful when it comes time to
sunset vendor-specific releases, like CDH4 for example.

On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan mri...@gmail.com
wrote:

 In ideal situation, +1 on removing all vendor specific builds and
 making just hadoop version specific - that is what we should depend on
 anyway.
 Though I hope Sean is correct in assuming that vendor specific builds
 for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause
 incompatibilities for us or our users !

 Regards,
 Mridul


 On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen so...@cloudera.com wrote:
  Yes, you should always find working bits at Apache no matter what --
  though 'no matter what' really means 'as long as you use Hadoop distro
  compatible with upstream Hadoop'. Even distros have a strong interest
  in that, since the market, the 'pie', is made large by this kind of
  freedom at the core.
 
  If tso, then no vendor-specific builds are needed, only some
  Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
  good (although I'm not yet clear if there's something about 2.5 or 2.6
  that needs a different build.)
 
  I take it that we already believe that, say, the Hadoop 2.4 build
  works with CDH5, so no CDH5-specific build is provided by Spark.
 
  If a distro doesn't work with stock Spark, then it's either something
  Spark should fix (e.g. use of a private YARN API or something), or
  it's something the distro should really fix because it's incompatible.
 
  Could we maybe rename the CDH4 build then, as it doesn't really work
  with all CDH4, to be a Hadoop 2.0.x build? That's been floated
  before. And can we remove the MapR builds -- or else can someone
  explain why these exist separately from a Hadoop 2.3 build? I hope it
  is not *because* they are somehow non-standard. And shall we first run
  down why Spark doesn't fully work on HDP and see if it's something
  that Spark or HDP needs to tweak, rather than contemplate another
  binary? or, if so, can it simply be called a Hadoop 2.7 + YARN
  whatever build and not made specific to a vendor, even if the project
  has to field another tarball combo for a vendor?
 
  Maybe we are saying almost the same thing.
 
 
  On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Yeah, my concern is that people should get Apache Spark from *Apache*,
 not from a vendor. It helps everyone use the latest features no matter
 where they are. In the Hadoop distro case, Hadoop made all this effort to
 have standard APIs (e.g. YARN), so it should be easy. But it is a problem
 if we're not packaging for the newest versions of some distros; I think we
 just fell behind at Hadoop 2.4.
 
  Matei
 
  On Mar 8, 2015, at 8:02 PM, Sean Owen so...@cloudera.com wrote:
 
  Yeah it's not much overhead, but here's an example of where it causes
  a little issue.
 
  I like that reasoning. However, the released builds don't track the
  later versions of Hadoop that vendors would be distributing -- there's
  no Hadoop 2.6 build for example. CDH4 is here, but not the
  far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
  actually work with many CDH4 versions.
 
  I agree with the goal of maximizing the reach of Spark, but I don't
  know how much these builds advance that goal.
 
  Anyone can roll-their-own exactly-right build, and the docs and build
  have been set up to make that as simple as can be expected. So these
  aren't *required* to let me use latest Spark on distribution X.
 
  I had thought these existed to sorta support 'legacy' distributions,
  like CDH4, and that build was justified as a
  quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
  the MapR profiles are for.
 
  I think it's too much work to correctly, in parallel, maintain any
  customizations necessary for any major distro, and it might be best to
  do not at all than to do it incompletely. You could say it's also an
  enabler for distros to vary in ways that require special
  customization.
 
  Maybe there's a concern that, if lots of people consume Spark on
  Hadoop, and most people consume Hadoop through distros, and distros
  alone manage Spark distributions, then you de facto 'have to' go
  through a distro instead of get bits from Spark? Different
  conversation but I think this sort of effect does not end up being a
  negative.
 
  Well anyway, I like the idea of seeing how far Hadoop-provided
  releases can help. It might kill several birds with one stone.
 
  On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia 
 matei.zaha...@gmail.com wrote:
  Our goal is to let people use the latest Apache release even if
 vendors fall behind or don't want to package everything, so that's why we
 put out releases for vendors' versions. It's fairly low overhead.
 
  Matei
 
  On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:
 
  Ah.

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Ulanov, Alexander

Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the 
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the 
support of Double in the current source code), did the test with BIDMat and CPU 
Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com] 
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)


Xiangrui Meng men...@gmail.com writes:

 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x 
 slower than netlib-openblas. What is the overhead of using a GPU BLAS 
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, 
 Alex. The big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
 Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of 
 BIDMat+magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude 
 worse than a well-tuned CPU implementation, particularly for larger 
 matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this 
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs 
 may not be practical - although we could consider having a good GPU 
 backend available as an option. However, *ALL* users of MLlib could 
 benefit (potentially tremendously) from using a well-tuned CPU-based 
 BLAS implementation. Perhaps we should consider updating the mllib 
 guide with a more complete section for enabling high performance 
 binaries on OSX and Linux? Or better, figure out a way for the 
 system to fetch these automatically.

 - Evan



 On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander  
 alexander.ula...@hp.com wrote:

 Just to summarize this thread, I was finally able to make all 
 performance comparisons that we discussed. It turns out that:
 BIDMat-cublasBIDMat
 MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
 =netlib-cublasnetlib-blasf2jblas

 Below is the link to the spreadsheet with full results.

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
 378T9J5r7kwKSPkY/edit?usp=sharing

 One thing still needs exploration: does BIDMat-cublas perform 
 copying to/from machine’s RAM?

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, February 10, 2015 2:12 PM
 To: Evan R. Sparks
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Thanks, Evan! It seems that ticket was marked as duplicate though 
 the original one discusses slightly different topic. I was able to 
 link netlib with MKL from BIDMat binaries. Indeed, MKL is 
 statically linked inside a 60MB library.

 |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
 Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
 1569,233228 |

 It turn out that pre-compiled MKL is faster than precompiled 
 OpenBlas on my machine. Probably, I’ll add two more columns with 
 locally compiled openblas and cuda.

 Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Monday, February 09, 2015 6:06 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Great - perhaps we can move this discussion off-list and onto a 
 JIRA ticket? (Here's one: 
 https://issues.apache.org/jira/browse/SPARK-5705)

 It seems like this is going to be somewhat exploratory for a while 
 (and there's probably only a handful of us

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Xiangrui Meng

Krishna, I tested your linear regression example. For linear
regression, we changed its objective function from 1/n * \|A x -
b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
squares formulations. It means you could re-produce the same result by
multiplying the step size by 2. This is not a problem if both run
until convergence (if not blow up). However, in your example, a very
small step size is chosen and it didn't converge in 100 iterations. In
this case, the step size matters. I will put a note in the migration
guide. Thanks! -Xiangrui

On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote:
 I'm +1 as I have not heard of any one else seeing the Hive test
 failure, which is likely a test issue rather than code issue anyway,
 and not a blocker.

 On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote:
 Although the problem is small, especially if indeed the essential docs
 changes are following just a couple days behind the final release, I
 mean, why the rush if they're essential? wait a couple days, finish
 them, make the release.

 Answer is, I think these changes aren't actually essential given the
 comment from tdas, so: just mark these Critical? (although ... they do
 say they're changes for the 1.3 release, so kind of funny to get to
 them for 1.3.x or 1.4, but that's not important now.)

 I thought that Blocker really meant Blocker in this project, as I've
 been encouraged to use it to mean don't release without this. I
 think we should use it that way. Just thinking of it as extra
 Critical doesn't add anything. I don't think Documentation should be
 special-cased as less important, and I don't think there's confusion
 if Blocker means what it says, so I'd 'fix' that way.

 If nobody sees the Hive failure I observed, and if we can just zap
 those Blockers one way or the other, +1


 On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote:
 Sean,

 The docs are distributed and consumed in a fundamentally different way
 than Spark code itself. So we've always considered the deadline for
 doc changes to be when the release is finally posted.

 If there are small inconsistencies with the docs present in the source
 code for that release tag, IMO that doesn't matter much since we don't
 even distribute the docs with Spark's binary releases and virtually no
 one builds and hosts the docs on their own (that I am aware of, at
 least). Perhaps we can recommend if people want to build the doc
 sources that they should always grab the head of the most recent
 release branch, to set expectations accordingly.

 In the past we haven't considered it worth holding up the release
 process for the purpose of the docs. It just doesn't make sense since
 they are consumed as a service. If we decide to change this
 convention, it would mean shipping our releases later, since we
 could't pipeline the doc finalization with voting.

 - Patrick

 On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote:
 Given the title and tagging, it sounds like there could be some
 must-have doc changes to go with what is being released as 1.3. It can
 be finished later, and published later, but then the docs source
 shipped with the release doesn't match the site, and until then, 1.3
 is released without some must-have docs for 1.3 on the site.

 The real question to me is: are there any further, absolutely
 essential doc changes that need to accompany 1.3 or not?

 If not, just resolve these. If there are, then it seems like the
 release has to block on them. If there are some docs that should have
 gone in for 1.3, but didn't, but aren't essential, well I suppose it
 bears thinking about how to not slip as much work, but it doesn't
 block.

 I think Documentation issues certainly can be a blocker and shouldn't
 be specially ignored.


 BTW the UISeleniumSuite issue is a real failure, but I do not think it
 is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
 a regression from 1.2.x, but only affects tests, and only affects a
 subset of build profiles.




 On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Sean,

 SPARK-5310 Update SQL programming guide for 1.3
 SPARK-5183 Document data source API
 SPARK-6128 Update Spark Streaming Guide for Spark 1.3

 For these, the issue is that they are documentation JIRA's, which
 don't need to be timed exactly with the release vote, since we can
 update the documentation on the website whenever we want. In the past
 I've just mentally filtered these out when considering RC's. I see a
 few options here:

 1. We downgrade such issues away from Blocker (more clear, but we risk
 loosing them in the fray if they really are things we want to have
 before the release is posted).
 2. We provide a filter to the community that excludes 'Documentation'
 issues and shows all other blockers for 1.3. We can put this on the
 wiki, for instance.

 Which do you prefer?

 - Patrick

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Corey Nolet

+1 (non-binding)

- Verified signatures
- Built on Mac OS X and Fedora 21.

On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Excellent, Thanks Xiangrui. The mystery is solved.
 Cheers
 k/


 On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote:

  Krishna, I tested your linear regression example. For linear
  regression, we changed its objective function from 1/n * \|A x -
  b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
  squares formulations. It means you could re-produce the same result by
  multiplying the step size by 2. This is not a problem if both run
  until convergence (if not blow up). However, in your example, a very
  small step size is chosen and it didn't converge in 100 iterations. In
  this case, the step size matters. I will put a note in the migration
  guide. Thanks! -Xiangrui
 
  On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote:
   I'm +1 as I have not heard of any one else seeing the Hive test
   failure, which is likely a test issue rather than code issue anyway,
   and not a blocker.
  
   On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote:
   Although the problem is small, especially if indeed the essential docs
   changes are following just a couple days behind the final release, I
   mean, why the rush if they're essential? wait a couple days, finish
   them, make the release.
  
   Answer is, I think these changes aren't actually essential given the
   comment from tdas, so: just mark these Critical? (although ... they do
   say they're changes for the 1.3 release, so kind of funny to get to
   them for 1.3.x or 1.4, but that's not important now.)
  
   I thought that Blocker really meant Blocker in this project, as I've
   been encouraged to use it to mean don't release without this. I
   think we should use it that way. Just thinking of it as extra
   Critical doesn't add anything. I don't think Documentation should be
   special-cased as less important, and I don't think there's confusion
   if Blocker means what it says, so I'd 'fix' that way.
  
   If nobody sees the Hive failure I observed, and if we can just zap
   those Blockers one way or the other, +1
  
  
   On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   Sean,
  
   The docs are distributed and consumed in a fundamentally different
 way
   than Spark code itself. So we've always considered the deadline for
   doc changes to be when the release is finally posted.
  
   If there are small inconsistencies with the docs present in the
 source
   code for that release tag, IMO that doesn't matter much since we
 don't
   even distribute the docs with Spark's binary releases and virtually
 no
   one builds and hosts the docs on their own (that I am aware of, at
   least). Perhaps we can recommend if people want to build the doc
   sources that they should always grab the head of the most recent
   release branch, to set expectations accordingly.
  
   In the past we haven't considered it worth holding up the release
   process for the purpose of the docs. It just doesn't make sense since
   they are consumed as a service. If we decide to change this
   convention, it would mean shipping our releases later, since we
   could't pipeline the doc finalization with voting.
  
   - Patrick
  
   On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com
 wrote:
   Given the title and tagging, it sounds like there could be some
   must-have doc changes to go with what is being released as 1.3. It
 can
   be finished later, and published later, but then the docs source
   shipped with the release doesn't match the site, and until then, 1.3
   is released without some must-have docs for 1.3 on the site.
  
   The real question to me is: are there any further, absolutely
   essential doc changes that need to accompany 1.3 or not?
  
   If not, just resolve these. If there are, then it seems like the
   release has to block on them. If there are some docs that should
 have
   gone in for 1.3, but didn't, but aren't essential, well I suppose it
   bears thinking about how to not slip as much work, but it doesn't
   block.
  
   I think Documentation issues certainly can be a blocker and
 shouldn't
   be specially ignored.
  
  
   BTW the UISeleniumSuite issue is a real failure, but I do not think
 it
   is serious: http://issues.apache.org/jira/browse/SPARK-6205  It
 isn't
   a regression from 1.2.x, but only affects tests, and only affects a
   subset of build profiles.
  
  
  
  
   On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com
 
  wrote:
   Hey Sean,
  
   SPARK-5310 Update SQL programming guide for 1.3
   SPARK-5183 Document data source API
   SPARK-6128 Update Spark Streaming Guide for Spark 1.3
  
   For these, the issue is that they are documentation JIRA's, which
   don't need to be timed exactly with the release vote, since we can
   update the documentation on the

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Joseph Bradley

+1
Tested on Mac OS X

On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote:

 Krishna, I tested your linear regression example. For linear
 regression, we changed its objective function from 1/n * \|A x -
 b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
 squares formulations. It means you could re-produce the same result by
 multiplying the step size by 2. This is not a problem if both run
 until convergence (if not blow up). However, in your example, a very
 small step size is chosen and it didn't converge in 100 iterations. In
 this case, the step size matters. I will put a note in the migration
 guide. Thanks! -Xiangrui

 On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote:
  I'm +1 as I have not heard of any one else seeing the Hive test
  failure, which is likely a test issue rather than code issue anyway,
  and not a blocker.
 
  On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote:
  Although the problem is small, especially if indeed the essential docs
  changes are following just a couple days behind the final release, I
  mean, why the rush if they're essential? wait a couple days, finish
  them, make the release.
 
  Answer is, I think these changes aren't actually essential given the
  comment from tdas, so: just mark these Critical? (although ... they do
  say they're changes for the 1.3 release, so kind of funny to get to
  them for 1.3.x or 1.4, but that's not important now.)
 
  I thought that Blocker really meant Blocker in this project, as I've
  been encouraged to use it to mean don't release without this. I
  think we should use it that way. Just thinking of it as extra
  Critical doesn't add anything. I don't think Documentation should be
  special-cased as less important, and I don't think there's confusion
  if Blocker means what it says, so I'd 'fix' that way.
 
  If nobody sees the Hive failure I observed, and if we can just zap
  those Blockers one way or the other, +1
 
 
  On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Sean,
 
  The docs are distributed and consumed in a fundamentally different way
  than Spark code itself. So we've always considered the deadline for
  doc changes to be when the release is finally posted.
 
  If there are small inconsistencies with the docs present in the source
  code for that release tag, IMO that doesn't matter much since we don't
  even distribute the docs with Spark's binary releases and virtually no
  one builds and hosts the docs on their own (that I am aware of, at
  least). Perhaps we can recommend if people want to build the doc
  sources that they should always grab the head of the most recent
  release branch, to set expectations accordingly.
 
  In the past we haven't considered it worth holding up the release
  process for the purpose of the docs. It just doesn't make sense since
  they are consumed as a service. If we decide to change this
  convention, it would mean shipping our releases later, since we
  could't pipeline the doc finalization with voting.
 
  - Patrick
 
  On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote:
  Given the title and tagging, it sounds like there could be some
  must-have doc changes to go with what is being released as 1.3. It can
  be finished later, and published later, but then the docs source
  shipped with the release doesn't match the site, and until then, 1.3
  is released without some must-have docs for 1.3 on the site.
 
  The real question to me is: are there any further, absolutely
  essential doc changes that need to accompany 1.3 or not?
 
  If not, just resolve these. If there are, then it seems like the
  release has to block on them. If there are some docs that should have
  gone in for 1.3, but didn't, but aren't essential, well I suppose it
  bears thinking about how to not slip as much work, but it doesn't
  block.
 
  I think Documentation issues certainly can be a blocker and shouldn't
  be specially ignored.
 
 
  BTW the UISeleniumSuite issue is a real failure, but I do not think it
  is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
  a regression from 1.2.x, but only affects tests, and only affects a
  subset of build profiles.
 
 
 
 
  On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Sean,
 
  SPARK-5310 Update SQL programming guide for 1.3
  SPARK-5183 Document data source API
  SPARK-6128 Update Spark Streaming Guide for Spark 1.3
 
  For these, the issue is that they are documentation JIRA's, which
  don't need to be timed exactly with the release vote, since we can
  update the documentation on the website whenever we want. In the past
  I've just mentally filtered these out when considering RC's. I see a
  few options here:
 
  1. We downgrade such issues away from Blocker (more clear, but we
 risk
  loosing them in the fray if they really are things we want to have
  before the release is posted).
  2.

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Sam Halliday

Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote:

Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
support of Double in the current source code), did the test with BIDMat and
CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

Best regards, Alexander

-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

BTW, is anybody on this list going to the London Meetup in a few weeks?

https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

Would be nice to meet other people working on the guts of Spark! :-)

Xiangrui Meng men...@gmail.com writes:

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
wrote:
Better documentation for linking would be very helpful! Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019

On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
evan.spa...@gmail.com
wrote:

Thanks for compiling all the data and running these benchmarks,
Alex. The big takeaways here can be seen with this chart:

https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of
BIDMat+magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
worse than a well-tuned CPU implementation, particularly for larger
matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs
may not be practical - although we could consider having a good GPU
backend available as an option. However, *ALL* users of MLlib could
benefit (potentially tremendously) from using a well-tuned CPU-based
BLAS implementation. Perhaps we should consider updating the mllib
guide with a more complete section for enabling high performance
binaries on OSX and Linux? Or better, figure out a way for the
system to fetch these automatically.

- Evan

On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:

Just to summarize this thread, I was finally able to make all
performance comparisons that we discussed. It turns out that:
BIDMat-cublasBIDMat
MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
=netlib-cublasnetlib-blasf2jblas

Below is the link to the spreadsheet with full results.

https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
378T9J5r7kwKSPkY/edit?usp=sharing

One thing still needs exploration: does BIDMat-cublas perform
copying to/from machine’s RAM?

-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
To: Evan R. Sparks
Cc: Joseph Bradley; dev@spark.apache.org
Subject: RE: Using CUDA within Spark / boosting linear algebra

Thanks, Evan! It seems that ticket was marked as duplicate though
the original one discusses slightly different topic. I was able to
link netlib with MKL from BIDMat binaries. Indeed, MKL is
statically linked inside a 60MB library.

+---+
|100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
|1,638475459 |
|1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
1569,233228 |

It turn out that pre-compiled MKL is faster than precompiled
OpenBlas on my machine. Probably, I’ll add two more columns with
locally compiled openblas and cuda.

Alexander

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re:

Re: enum-like types in Spark

2015-03-09 Thread Imran Rashid

Can you expand on the serde issues w/ java enum's at all? I haven't heard
of any problems specific to enums. The java object serialization rules
seem very clear and it doesn't seem like different jvms should have a
choice on what they do:

http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469

(in a nutshell, serialization must use enum.name())

of course there are plenty of ways the user could screw this up(eg. rename
the enums, or change their meaning, or remove them). But then again, all
of java serialization has issues w/ serialization the user has to be aware
of. Eg., if we go with case objects, than java serialization blows up if
you add another helper method, even if that helper method is completely
compatible.

Some prior debate in the scala community:

https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ

SO post on which version to use in scala:

http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types

SO post about the macro-craziness people try to add to scala to make them
almost as good as a simple java enum:
(NB: the accepted answer doesn't actually work in all cases ...)

http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched

Another proposal to add better enums built into scala ... but seems to be
dormant:

https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk

On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan mri...@gmail.com
wrote:

I have a strong dislike for java enum's due to the fact that they
are not stable across JVM's - if it undergoes serde, you end up with
unpredictable results at times [1].
One of the reasons why we prevent enum's from being key : though it is
highly possible users might depend on it internally and shoot
themselves in the foot.

Would be better to keep away from them in general and use something more
stable.

Regards,
Mridul

[1] Having had to debug this issue for 2 weeks - I really really hate it.

On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid iras...@cloudera.com wrote:
I have a very strong dislike for #1 (scala enumerations). I'm ok with
#4
(with Xiangrui's final suggestion, especially making it sealed
available
in Java), but I really think #2, java enums, are the best option.

Java enums actually have some very real advantages over the other
approaches -- you get values(), valueOf(), EnumSet, and EnumMap. There
has
been endless debate in the Scala community about the problems with the
approaches in Scala. Very smart, level-headed Scala gurus have
complained
about their short-comings (Rex Kerr's name is coming to mind, though I'm
not positive about that); there have been numerous well-thought out
proposals to give Scala a better enum. But the powers-that-be in Scala
always reject them. IIRC the explanation for rejecting is basically that
(a) enums aren't important enough for introducing some new special
feature,
scala's got bigger things to work on and (b) if you really need a good
enum, just use java's enum.

I doubt it really matters that much for Spark internals, which is why I
think #4 is fine. But I figured I'd give my spiel, because every
developer
loves language wars :)

Imran

On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote:

`case object` inside an `object` doesn't show up in Java. This is the
minimal code I found to make everything show up correctly in both
Scala and Java:

sealed abstract class StorageLevel // cannot be a trait

object StorageLevel {
private[this] case object _MemoryOnly extends StorageLevel
final val MemoryOnly: StorageLevel = _MemoryOnly

private[this] case object _DiskOnly extends StorageLevel
final val DiskOnly: StorageLevel = _DiskOnly
}

On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell pwend...@gmail.com
wrote:
I like #4 as well and agree with Aaron's suggestion.

- Patrick

On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson ilike...@gmail.com
wrote:
I'm cool with #4 as well, but make sure we dictate that the values
should
be defined within an object with the same name as the enumeration
(like
we
do for StorageLevel). Otherwise we may pollute a higher namespace.

e.g. we SHOULD do:

trait StorageLevel
object StorageLevel {
case object MemoryOnly extends StorageLevel
case object DiskOnly extends StorageLevel
}

On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust
mich...@databricks.com
wrote:

#4 with a preference for CamelCaseEnums

On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley
jos...@databricks.com
wrote:

another vote for #4
People are already used to adding () in Java.

On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com

wrote:

#4 but with MemoryOnly (more scala-like)

http://docs.scala-lang.org/style/naming-conventions.html

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Krishna Sankar

Excellent, Thanks Xiangrui. The mystery is solved.
Cheers
k/


On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote:

 Krishna, I tested your linear regression example. For linear
 regression, we changed its objective function from 1/n * \|A x -
 b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
 squares formulations. It means you could re-produce the same result by
 multiplying the step size by 2. This is not a problem if both run
 until convergence (if not blow up). However, in your example, a very
 small step size is chosen and it didn't converge in 100 iterations. In
 this case, the step size matters. I will put a note in the migration
 guide. Thanks! -Xiangrui

 On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote:
  I'm +1 as I have not heard of any one else seeing the Hive test
  failure, which is likely a test issue rather than code issue anyway,
  and not a blocker.
 
  On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote:
  Although the problem is small, especially if indeed the essential docs
  changes are following just a couple days behind the final release, I
  mean, why the rush if they're essential? wait a couple days, finish
  them, make the release.
 
  Answer is, I think these changes aren't actually essential given the
  comment from tdas, so: just mark these Critical? (although ... they do
  say they're changes for the 1.3 release, so kind of funny to get to
  them for 1.3.x or 1.4, but that's not important now.)
 
  I thought that Blocker really meant Blocker in this project, as I've
  been encouraged to use it to mean don't release without this. I
  think we should use it that way. Just thinking of it as extra
  Critical doesn't add anything. I don't think Documentation should be
  special-cased as less important, and I don't think there's confusion
  if Blocker means what it says, so I'd 'fix' that way.
 
  If nobody sees the Hive failure I observed, and if we can just zap
  those Blockers one way or the other, +1
 
 
  On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Sean,
 
  The docs are distributed and consumed in a fundamentally different way
  than Spark code itself. So we've always considered the deadline for
  doc changes to be when the release is finally posted.
 
  If there are small inconsistencies with the docs present in the source
  code for that release tag, IMO that doesn't matter much since we don't
  even distribute the docs with Spark's binary releases and virtually no
  one builds and hosts the docs on their own (that I am aware of, at
  least). Perhaps we can recommend if people want to build the doc
  sources that they should always grab the head of the most recent
  release branch, to set expectations accordingly.
 
  In the past we haven't considered it worth holding up the release
  process for the purpose of the docs. It just doesn't make sense since
  they are consumed as a service. If we decide to change this
  convention, it would mean shipping our releases later, since we
  could't pipeline the doc finalization with voting.
 
  - Patrick
 
  On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote:
  Given the title and tagging, it sounds like there could be some
  must-have doc changes to go with what is being released as 1.3. It can
  be finished later, and published later, but then the docs source
  shipped with the release doesn't match the site, and until then, 1.3
  is released without some must-have docs for 1.3 on the site.
 
  The real question to me is: are there any further, absolutely
  essential doc changes that need to accompany 1.3 or not?
 
  If not, just resolve these. If there are, then it seems like the
  release has to block on them. If there are some docs that should have
  gone in for 1.3, but didn't, but aren't essential, well I suppose it
  bears thinking about how to not slip as much work, but it doesn't
  block.
 
  I think Documentation issues certainly can be a blocker and shouldn't
  be specially ignored.
 
 
  BTW the UISeleniumSuite issue is a real failure, but I do not think it
  is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
  a regression from 1.2.x, but only affects tests, and only affects a
  subset of build profiles.
 
 
 
 
  On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Sean,
 
  SPARK-5310 Update SQL programming guide for 1.3
  SPARK-5183 Document data source API
  SPARK-6128 Update Spark Streaming Guide for Spark 1.3
 
  For these, the issue is that they are documentation JIRA's, which
  don't need to be timed exactly with the release vote, since we can
  update the documentation on the website whenever we want. In the past
  I've just mentally filtered these out when considering RC's. I see a
  few options here:
 
  1. We downgrade such issues away from Blocker (more clear, but we
 risk
  loosing them in the fray if they really are things we want to

Re: missing explanation of cache in the documentation of cluster overview

2015-03-09 Thread Sean Owen

It's explained at
https://spark.apache.org/docs/latest/programming-guide.html and it's
configuration at
https://spark.apache.org/docs/latest/configuration.html  Have a read
over all the docs first.

On Mon, Mar 9, 2015 at 9:24 AM, Hui WANG hedonp...@gmail.com wrote:
 Hello Guys,

 I'm reading the documentation of cluster mode overview on
 https://spark.apache.org/docs/latest/cluster-overview.html.

 In the schema, cache is shown aside executor but no explanation is done on
 it.

 Can someone please help to explain it and improve this page ?

 --
 Hui WANG
 Tel : +33 (0) 6 71 33 45 39
 Blog : http://www.hui-wang.info

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

RE: Using CUDA within Spark / boosting linear algebra

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

RE: Using CUDA within Spark / boosting linear algebra

Re: enum-like types in Spark

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

Re: missing explanation of cache in the documentation of cluster overview

11 matches

Site Navigation

Mail list logo

Footer information