Re: Eliminate copy while sending data : any Akka experts here ?
In our clusters, number of containers we can get is high but memory per container is low : which is why avg_nodes_not_hosting data is rarely zero for ML tasks :-) To update - to unblock our current implementation efforts, we went with broadcast - since it is intutively easier and minimal change; and compress the array as bytes in TaskResult. This is then stored in disk backed maps - to remove memory pressure on master and workers (else MapOutputTracker becomes a memory hog). But I agree, compressed bitmap to represent 'large' blocks (anything larger that maxBytesInFlight actually) and probably existing to track non zero should be fine (we should not really track zero output for reducer - just waste of space). Regards, Mridul On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin r...@databricks.com wrote: Note that in my original proposal, I was suggesting we could track whether block size = 0 using a compressed bitmap. That way we can still avoid requests for zero-sized blocks. On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin r...@databricks.com wrote: Yes, that number is likely == 0 in any real workload ... On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan mri...@gmail.com wrote: On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin r...@databricks.com wrote: On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan mri...@gmail.com wrote: The other thing we do need is the location of blocks. This is actually just O(n) because we just need to know where the map was run. For well partitioned data, wont this not involve a lot of unwanted requests to nodes which are not hosting data for a reducer (and lack of ability to throttle). Was that a question? (I'm guessing it is). What do you mean exactly? I was not sure if I understood the proposal correctly - hence the query : if I understood it right - the number of wasted requests goes up by num_reducers * avg_nodes_not_hosting data. Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine ! Regards, Mridul
Re: PLSA
Hi, Deb. I don't quite understand the question. PLSA is an instance of matrix factorization problem. If you are asking about inference algorithm, we use EM-algorithm. Description of this approach is, for example, here: http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf Best, Denis. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7179.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: PLSA
Thanks for the pointer... Looks like you are using EM algorithm for factorization which looks similar to multiplicative update rules Do you think using mllib ALS implicit feedback, you can scale the problem further ? We can handle L1, L2, equality and positivity constraints in ALS now...As long as you can find the gradient and hessian from the KL divergence loss, you can use that in place of gram matrix that is used in ALS right now If you look in topic modeling work in Solr (Carrot is the package), they use ALS to generate the topics...that algorithm looks like a simplified version of what you are attempting here... May be the EM algorithm for topic modeling is efficient than ALS but from looking at it I don't see how...I see lot of broadcasts...while in implicit feedback you need one broadcast of gram matrix... On Fri, Jul 4, 2014 at 4:27 AM, Denis Turdakov turda...@ispras.ru wrote: Hi, Deb. I don't quite understand the question. PLSA is an instance of matrix factorization problem. If you are asking about inference algorithm, we use EM-algorithm. Description of this approach is, for example, here: http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf Best, Denis. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7179.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Constraint Solver for Spark
I looked further and realized that ECOS used a mex file while PDCO is using pure Matlab code. So the out-of-box runtime comparison is not fair. I am trying to generate PDCO C port. Like ECOS, PDCO also makes use of sparse support from Tim Davis. Thanks. Deb
Invalid link for Spark 1.0.0 in Official Web Site
Hi, I found there is a invalid link in http://spark.apache.org/downloads.html . The link for release note of Spark 1.0.0 indicates http://spark.apache.org/releases/spark-release-1.0.0.html but this link is invalid. I think that is mistake for http://spark.apache.org/releases/spark-release-1-0-0.html. Thanks, Kousuke
[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC1)
This vote is cancelled in favor of RC2. Thanks to everyone who voted. On Sun, Jun 29, 2014 at 11:23 PM, Andrew Ash and...@andrewash.com wrote: Ok that's reasonable -- it's certainly more of an enhancement than a critical bug-fix. I would like to get this in for 1.1.0 though, so let's talk through the right way to do that on the PR. In the meantime the best alternative is running with lax firewall settings, which can be somewhat mitigated by modifying the ephemeral port range. Thanks! Andrew On Sun, Jun 29, 2014 at 11:14 PM, Reynold Xin r...@databricks.com wrote: Hi Andrew, The port stuff is great to have, but they are pretty big changes to the core that are introducing new features and are not exactly fixing important bugs. For this reason, it probably can't block a release (I'm not even sure if it should go into a maintenance release where we fix critical bugs for Spark core). We should definitely include them for 1.1.0 though (~Aug). On Sun, Jun 29, 2014 at 11:09 PM, Andrew Ash and...@andrewash.com wrote: Thanks for helping shepherd the voting on 1.0.1 Patrick. I'd like to call attention to https://issues.apache.org/jira/browse/SPARK-2157 and https://github.com/apache/spark/pull/1107 -- Ability to write tight firewall rules for Spark I'm currently unable to run Spark on some projects because our cloud ops team is uncomfortable with the firewall situation around Spark at the moment. Currently Spark starts listening on random ephemeral ports and does server to server communication on them. This keeps the team from writing tight firewall rules between the services -- they get real queasy when asked to open inbound connections to the entire ephemeral port range of a cluster. We can tighten the size of the ephemeral range using kernel settings to mitigate the issue, but it doesn't actually solve the problem. The PR above aims to make every listening port on JVMs in a Spark standalone cluster configurable with an option. If not set, the current behavior stands (start listening on an ephemeral port). Is this something the Spark team would consider merging into 1.0.1? Thanks! Andrew On Sun, Jun 29, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, We're going to move onto another rc because of this vote. Unfortunately with the summit activities I haven't been able to usher in the necessary patches and cut the RC. I will do so as soon as possible and we can commence official voting. - Patrick On Sun, Jun 29, 2014 at 4:56 PM, Reynold Xin r...@databricks.com wrote: We should make sure we include the following two patches: https://github.com/apache/spark/pull/1264 https://github.com/apache/spark/pull/1263 On Fri, Jun 27, 2014 at 8:39 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 Compiled for CentOS 6.5, deployed in our 4 node cluster (Hadoop 2.2, YARN) Smoke Tests (sparkPi,spark-shell, web UI) successful Cheers k/ On Thu, Jun 26, 2014 at 7:06 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7feeda3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1020/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, June 30, at 03:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
[VOTE] Release Apache Spark 1.0.1 (RC2)
Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
2nd Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2)
(apologies for Cross Posting) 2nd Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2) http://wssspe.researchcomputing.org.uk/wssspe2/ (to be held in conjunction with SC14, Sunday, 16 November 2014, New Orleans, LA, USA) Progress in scientific research is dependent on the quality and accessibility of software at all levels and it is critical to address challenges related to the development, deployment, and maintenance of reusable software as well as education around software practices. These challenges can be technological, policy based, organizational, and educational, and are of interest to developers (the software community), users (science disciplines), and researchers studying the conduct of science (science of team science, science of organizations, science of science and innovation policy, and social science communities). The WSSSPE1 workshop (http://wssspe.researchcomputing.org.uk/WSSSPE1) engaged the broad scientific community to identify challenges and best practices in areas of interest for sustainable scientific software. At WSSSPE2, we invite the community to propose and discuss specific mechanisms to move towards an imagined future practice of software development and usage in science and engineering. The workshop will include multiple mechanisms for participation, encourage team building around solutions, and identify risky solutions with potentially transformative outcomes. Participation by early career students and postdoctoral researchers is strongly encouraged. We invite short (4-page) actionable papers that will lead to improvements for sustainable software science. These papers could be a call to action, or could provide position or experience reports on sustainable software activities. The papers will be used by the organizing committee to design sessions that will be highly interactive and targeted towards facilitating action. Submitted papers should be archived by a third-party service that provides DOIs. We encourage submitters to license their papers under a Creative Commons license that encourages sharing and remixing, as we will combine ideas (with attribution) into the outcomes of the workshop. The organizers will invite one or more submitters of provocative papers to start the workshop by presenting highlights of their papers in a keynote presentation to initiate active discussion that will continue throughout the day. Areas of interest for WSSSPE2, include, but are not limited to: =80 defining software sustainability in the context of science and engineering software =80 how to evaluate software sustainability =80 improving the development process that leads to new software =80 methods to develop sustainable software from the outset =80 effective approaches to reusable software created as a by-product of research =80 impact of computer science research on the development of scientific software =80 recommendations for the support and maintenance of existing software =80 software engineering best practices =80 governance, business, and sustainability models =80 the role of community software repositories, their operation and sustainability =80 reproducibility, transparency needs that may be unique to science =80 successful open source software implementations =80 incentives for using and contributing to open source software =80 transitioning users into contributing developers =80 building large and engaged user communities =80 developing strong advocates =80 measurement of usage and impact =80 encouraging industry=B9s role in sustainability =80 engagement of industry with volunteer communities =80 incentives for industry =80 incentives for community to contribute to industry-driven projects =80 recommending policy changes =80 software credit, attribution, incentive, and reward =80 issues related to multiple organizations and multiple countries, such a= s intellectual property, licensing, etc. =80 mechanisms and venues for publishing software, and the role of publishe= rs =80 improving education and training =80 best practices for providing graduate students and postdoctoral researchers in domain communities with sufficient training in software development =80 novel uses of sustainable software in education (K-20) =80 case studies from students on issues around software development in the undergraduate or graduate curricula =80 careers and profession =80 successful examples of career paths for developers =80 institutional changes to support sustainable software such as promotion and tenure metrics, job categories, etc. Submissions: Submissions of up to four pages should be formatted to be easily readable and submitted to an open access repository that provides unique identifiers (e.g., DOIs) that can be cited, for example http://arXiv.org http://arxiv.org/ or http://figshare.com http://figshare.com/. Once you have received an identifier for your self-published paper from a repository, submit it to WSSSPE2 by creating a new submission at