Re: Temporary jenkins issue
It looks like this may be fixed soon in Jenkins: https://issues.jenkins-ci.org/browse/JENKINS-25446 https://github.com/jenkinsci/flaky-test-handler-plugin/pull/1 On February 2, 2015 at 7:38:19 PM, Patrick Wendell (pwend...@gmail.com) wrote: Hey All, I made a change to the Jenkins configuration that caused most builds to fail (attempting to enable a new plugin), I've reverted the change effective about 10 minutes ago. If you've seen recent build failures like below, this was caused by that change. Sorry about that. ERROR: Publisher com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver aborted due to exception java.lang.NoSuchMethodError: hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction; at com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.init(FlakyTestResultAction.java:78) at com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770) at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734) at hudson.model.Build$BuildExecution.post2(Build.java:183) at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683) at hudson.model.Run.execute(Run.java:1784) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240) - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Data source API | sizeInBytes should be to *Scan
Thanks for looking into this. If this true, isn't this an issue today? The default implementation of sizeInBytes is 1 + broadcast threshold. So, if catalyst's cardinality estimation estimates even a small filter selectivity, it will result in broadcasting the relation. Therefore, shouldn't the default be much higher than broadcast threshold? Also, since the default implementation of sizeInBytes already exists in BaseRelation, I am not sure why the same/similar default implementation can't be provided with in *Scan specific sizeInBytes functions and have Catalyst always trust the size returned by DataSourceAPI (with default implementation being to never broadcast). Another thing that could be done is have sizeInBytes return Option[Long] so that Catalyst explicitly knows when DataSource was able to optimize the size. The reason why I would push for sizeInBytes in *Scan interfaces is because at times the data source implementation can more accurately predict the size output. For example, DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can easy use filter push downs to query the underlying storage to predict the size. Such predictions will be more accurate than Catalyst's prediction. Therefore, if its not a fundamental change in Catalyst, I would think this makes sense. Thanks, Aniket On Sat, Feb 7, 2015, 4:50 AM Reynold Xin r...@databricks.com wrote: We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due to the potential problems the change the cause. The main problem I see is that column pruning/predicate pushdowns are advisory, i.e. the data source might or might not apply those filters. Without significantly complicating the data source API, it is hard for the optimizer (and future cardinality estimation) to know whether the filter/column pushdowns are advisory, and whether to incorporate that in cardinality estimation. Imagine this scenario: a data source applies a filter and estimates the filter's selectivity is 0.1, then the data set is reduced to 10% of the size. Catalyst's own cardinality estimation estimates the filter selectivity to 0.1 again, and thus the estimated data size is now 1% of the original data size, lowering than some threshold. Catalyst decides to broadcast the table. The actual table size is actually 10x the size. On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi Spark SQL committers I have started experimenting with data sources API and I was wondering if it makes sense to move the method sizeInBytes from BaseRelation to Scan interfaces. This is because that a relation may be able to leverage filter push down to estimate size potentially making a very large relation broadcast-able. Thoughts? Aniket
Re: Improving metadata in Spark JIRA
I think we already have a YARN component. https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20YARN I don't think JIRA allows it to be mandatory, but if it does, that would be useful. On Sat, Feb 7, 2015 at 5:08 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: By the way, isn't it possible to make the Component field mandatory when people open new issues? Shouldn't we do that? Btw Patrick, don't we need a YARN component? I think our JIRA components should roughly match the components on the PR dashboard. Nick On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell pwend...@gmail.com wrote: Per Nick's suggestion I added two components: 1. Spark Submit 2. Spark Scheduler I figured I would just add these since if we decide later we don't want them, we can simply merge them into Spark Core. On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Do we need some new components to be added to the JIRA project? Like: - scheduler - YARN - spark-submit - ...? Nick On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: +9000 on cleaning up JIRA. Thank you Sean for laying out some specific things to tackle. I will assist with this. Regarding email, I think Sandy is right. I only get JIRA email for issues I'm watching. Nick On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com wrote: JIRA updates don't go to this list, they go to iss...@spark.apache.org. I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote: I've wasted no time in wielding the commit bit to complete a number of small, uncontroversial changes. I wouldn't commit anything that didn't already appear to have review, consensus and little risk, but please let me know if anything looked a little too bold, so I can calibrate. Anyway, I'd like to continue some small house-cleaning by improving the state of JIRA's metadata, in order to let it give us a little clearer view on what's happening in the project: a. Add Component to every (open) issue that's missing one b. Review all Critical / Blocker issues to de-escalate ones that seem obviously neither c. Correct open issues that list a Fix version that has already been released d. Close all issues Resolved for a release that has already been released The problem with doing so is that it will create a tremendous amount of email to the list, like, several hundred. It's possible to make bulk changes and suppress e-mail though, which could be done for all but b. Better to suppress the emails when making such changes? or just not bother on some of these? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Improving metadata in Spark JIRA
Oh derp, missed the YARN component. JIRA, does allow admins to make fields mandatory: https://confluence.atlassian.com/display/JIRA/Specifying+Field+Behavior#SpecifyingFieldBehavior-Makingafieldrequiredoroptional Nick On Sat Feb 07 2015 at 5:23:10 PM Patrick Wendell pwend...@gmail.com wrote: I think we already have a YARN component. https://issues.apache.org/jira/issues/?jql=project%20% 3D%20SPARK%20AND%20component%20%3D%20YARN I don't think JIRA allows it to be mandatory, but if it does, that would be useful. On Sat, Feb 7, 2015 at 5:08 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: By the way, isn't it possible to make the Component field mandatory when people open new issues? Shouldn't we do that? Btw Patrick, don't we need a YARN component? I think our JIRA components should roughly match the components on the PR dashboard. Nick On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell pwend...@gmail.com wrote: Per Nick's suggestion I added two components: 1. Spark Submit 2. Spark Scheduler I figured I would just add these since if we decide later we don't want them, we can simply merge them into Spark Core. On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Do we need some new components to be added to the JIRA project? Like: - scheduler - YARN - spark-submit - ...? Nick On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: +9000 on cleaning up JIRA. Thank you Sean for laying out some specific things to tackle. I will assist with this. Regarding email, I think Sandy is right. I only get JIRA email for issues I'm watching. Nick On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com wrote: JIRA updates don't go to this list, they go to iss...@spark.apache.org. I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote: I've wasted no time in wielding the commit bit to complete a number of small, uncontroversial changes. I wouldn't commit anything that didn't already appear to have review, consensus and little risk, but please let me know if anything looked a little too bold, so I can calibrate. Anyway, I'd like to continue some small house-cleaning by improving the state of JIRA's metadata, in order to let it give us a little clearer view on what's happening in the project: a. Add Component to every (open) issue that's missing one b. Review all Critical / Blocker issues to de-escalate ones that seem obviously neither c. Correct open issues that list a Fix version that has already been released d. Close all issues Resolved for a release that has already been released The problem with doing so is that it will create a tremendous amount of email to the list, like, several hundred. It's possible to make bulk changes and suppress e-mail though, which could be done for all but b. Better to suppress the emails when making such changes? or just not bother on some of these? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] *Sent:* Friday, February 06, 2015 5:19 PM *To:* Ulanov, Alexander *Cc:* Joseph Bradley; dev@spark.apache.org *Subject:* Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly
[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC3)
This vote passes with 5 +1 votes (3 binding) and no 0 or -1 votes. +1 Votes: Krishna Sankar Sean Owen* Chip Senkbeil Matei Zaharia* Patrick Wendell* 0 Votes: (none) -1 Votes: (none) On Fri, Feb 6, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I'll add a +1 as well. On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1065/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/ Changes from rc2: A single patch fixing a windows issue. Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, February 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Data source API | sizeInBytes should be to *Scan
We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due to the potential problems the change the cause. The main problem I see is that column pruning/predicate pushdowns are advisory, i.e. the data source might or might not apply those filters. Without significantly complicating the data source API, it is hard for the optimizer (and future cardinality estimation) to know whether the filter/column pushdowns are advisory, and whether to incorporate that in cardinality estimation. Imagine this scenario: a data source applies a filter and estimates the filter's selectivity is 0.1, then the data set is reduced to 10% of the size. Catalyst's own cardinality estimation estimates the filter selectivity to 0.1 again, and thus the estimated data size is now 1% of the original data size, lowering than some threshold. Catalyst decides to broadcast the table. The actual table size is actually 10x the size. On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi Spark SQL committers I have started experimenting with data sources API and I was wondering if it makes sense to move the method sizeInBytes from BaseRelation to Scan interfaces. This is because that a relation may be able to leverage filter push down to estimate size potentially making a very large relation broadcast-able. Thoughts? Aniket
Re: Improving metadata in Spark JIRA
By the way, isn't it possible to make the Component field mandatory when people open new issues? Shouldn't we do that? Btw Patrick, don't we need a YARN component? I think our JIRA components should roughly match the components on the PR dashboard https://spark-prs.appspot.com/. Nick On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell pwend...@gmail.com wrote: Per Nick's suggestion I added two components: 1. Spark Submit 2. Spark Scheduler I figured I would just add these since if we decide later we don't want them, we can simply merge them into Spark Core. On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Do we need some new components to be added to the JIRA project? Like: - scheduler - YARN - spark-submit - ...? Nick On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: +9000 on cleaning up JIRA. Thank you Sean for laying out some specific things to tackle. I will assist with this. Regarding email, I think Sandy is right. I only get JIRA email for issues I'm watching. Nick On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com wrote: JIRA updates don't go to this list, they go to iss...@spark.apache.org . I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote: I've wasted no time in wielding the commit bit to complete a number of small, uncontroversial changes. I wouldn't commit anything that didn't already appear to have review, consensus and little risk, but please let me know if anything looked a little too bold, so I can calibrate. Anyway, I'd like to continue some small house-cleaning by improving the state of JIRA's metadata, in order to let it give us a little clearer view on what's happening in the project: a. Add Component to every (open) issue that's missing one b. Review all Critical / Blocker issues to de-escalate ones that seem obviously neither c. Correct open issues that list a Fix version that has already been released d. Close all issues Resolved for a release that has already been released The problem with doing so is that it will create a tremendous amount of email to the list, like, several hundred. It's possible to make bulk changes and suppress e-mail though, which could be done for all but b. Better to suppress the emails when making such changes? or just not bother on some of these? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC3)
I'll add a +1 as well. On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1065/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/ Changes from rc2: A single patch fixing a windows issue. Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, February 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Spark SQL Window Functions
Currently there's no standard way of handling time series data in Spark. We were kicking around some ideas in the lab today and one thing that came up was SQL Window Functions as a way to support them and query over time series (do things like moving average, etc.) These don't seem to be implemented in Spark SQL yet, but there's some discussion on JIRA (https://issues.apache.org/jira/browse/SPARK-3587) asking for them, and there have also been a couple of pull requests - https://github.com/apache/spark/pull/3703 and https://github.com/apache/spark/pull/2953. Is any work currently underway here?
Pull Requests on github
Hi all, I'm the author of netlib-java and I noticed that the documentation in MLlib was out of date and misleading, so I submitted a pull request on github which will hopefully make things easier for everybody to understand the benefits of system optimised natives and how to use them :-) https://github.com/apache/spark/pull/4448 However, it looks like there are a *lot* of outstanding PRs and that this is just a mirror repository. Will somebody please look at my PR and merge into the canonical source (and let me know)? Best regards, Sam -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Using CUDA within Spark / boosting linear algebra
Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at
Re: [VOTE] Release Apache Spark 1.2.1 (RC3)
+1 Tested on Mac OS X. Matei On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1065/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/ Changes from rc2: A single patch fixing a windows issue. Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, February 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC3)
Should we merge this commit into branch1.2 too? https://github.com/apache/spark/commit/2483c1efb6429a7d8a20c96d18ce2fec93a1aff9 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC3-tp10405p10503.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Unit tests
Hey All, The tests are in a not-amazing state right now due to a few compounding factors: 1. We've merged a large volume of patches recently. 2. The load on jenkins has been relatively high, exposing races and other behavior not seen at lower load. For those not familiar, the main issue is flaky (non deterministic) test failures. Right now I'm trying to prioritize keeping the PullReqeustBuilder in good shape since it will block development if it is down. For other tests, let's try to keep filing JIRA's when we see issues and use the flaky-test label (see http://bit.ly/1yRif9S): I may contact people regarding specific tests. This is a very high priority to get in good shape. This kind of thing is no one's fault but just the result of a lot of concurrent development, and everyone needs to pitch in to get back in a good place. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Using CUDA within Spark / boosting linear algebra
Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov, Alexander Cc: dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat (https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
Re: Using CUDA within Spark / boosting linear algebra
Lemme butt in randomly here and say there is an interesting discussion on this Spark PR https://github.com/apache/spark/pull/4448 about netlib-java, JBLAS, Breeze, and other things I know nothing of, that y'all may find interesting. Among the participants is the author of netlib-java. On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things that are essential to machine learning. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 1:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection - it's definitely a worthwhile experiment to run. The main speedups I've seen from using it come from highly optimized GPU code for linear algebra. I know that in the past Canny has gone as far as to write custom GPU kernels for performance-critical regions of code.[1] BIDMach is highly optimized for single node performance or performance on small clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in that way) the performance tends to fall off. Canny argues for hardware/software codesign and as such prefers machine configurations that are quite different than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. In contrast, MLlib was designed for horizontal scalability on commodity clusters and works best on very big datasets - order of terabytes. For the most part, these projects developed concurrently to address slightly different use cases. That said, there may be bits of BIDMach we could repurpose for MLlib - keep in mind we need to be careful about maintaining cross-language compatibility for our Java and Python-users, though. - Evan [1] - http://arxiv.org/abs/1409.5402 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com mailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what makes them faster than netlib-java? The same group has BIDMach library that implements machine learning. For some examples they use Caffe convolutional neural network library owned by another group in Berkeley. Could you elaborate on how these all might be connected with Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach for optimization and learning? Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Thursday, February 05, 2015 12:09 PM To: Ulanov,
Re: Pull Requests on github
You can open a Jira issue pointing this PR to get it processed faster. :) Thanks Best Regards On Sat, Feb 7, 2015 at 7:07 AM, fommil sam.halli...@gmail.com wrote: Hi all, I'm the author of netlib-java and I noticed that the documentation in MLlib was out of date and misleading, so I submitted a pull request on github which will hopefully make things easier for everybody to understand the benefits of system optimised natives and how to use them :-) https://github.com/apache/spark/pull/4448 However, it looks like there are a *lot* of outstanding PRs and that this is just a mirror repository. Will somebody please look at my PR and merge into the canonical source (and let me know)? Best regards, Sam -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL Window Functions
This is the original ticket: https://issues.apache.org/jira/browse/SPARK-1442 I believe it will happen, one way or another :) On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Currently there's no standard way of handling time series data in Spark. We were kicking around some ideas in the lab today and one thing that came up was SQL Window Functions as a way to support them and query over time series (do things like moving average, etc.) These don't seem to be implemented in Spark SQL yet, but there's some discussion on JIRA (https://issues.apache.org/jira/browse/SPARK-3587) asking for them, and there have also been a couple of pull requests - https://github.com/apache/spark/pull/3703 and https://github.com/apache/spark/pull/2953. Is any work currently underway here?
Re: Welcoming three new committers
Congratulations guys! Keep helping this awesome community. BR, Jacky Li - 发自 Smartisan T1 - 2015年2月4日,上午6:36于 Matei Zaharia matei.zaha...@gmail.com 写道: Hi all, The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley and Sean Owen. All three have been major contributors to Spark in the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many pieces throughout Spark Core. Join me in welcoming them as committers! Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org