Re: Eliminate copy while sending data : any Akka experts here ?

2014-07-04 Thread Mridul Muralidharan
In our clusters, number of containers we can get is high but memory
per container is low : which is why avg_nodes_not_hosting data is
rarely zero for ML tasks :-)

To update - to unblock our current implementation efforts, we went
with broadcast - since it is intutively easier and minimal change; and
compress the array as bytes in TaskResult.
This is then stored in disk backed maps - to remove memory pressure on
master and workers (else MapOutputTracker becomes a memory hog).

But I agree, compressed bitmap to represent 'large' blocks (anything
larger that maxBytesInFlight actually) and probably existing to track
non zero should be fine (we should not really track zero output for
reducer - just waste of space).


Regards,
Mridul

On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin r...@databricks.com wrote:
 Note that in my original proposal, I was suggesting we could track whether
 block size = 0 using a compressed bitmap. That way we can still avoid
 requests for zero-sized blocks.



 On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin r...@databricks.com wrote:

 Yes, that number is likely == 0 in any real workload ...


 On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan mri...@gmail.com
 wrote:

 On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin r...@databricks.com wrote:
  On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
 
  
   The other thing we do need is the location of blocks. This is
 actually
  just
   O(n) because we just need to know where the map was run.
 
  For well partitioned data, wont this not involve a lot of unwanted
  requests to nodes which are not hosting data for a reducer (and lack
  of ability to throttle).
 
 
  Was that a question? (I'm guessing it is). What do you mean exactly?


 I was not sure if I understood the proposal correctly - hence the
 query : if I understood it right - the number of wasted requests goes
 up by num_reducers * avg_nodes_not_hosting data.

 Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine !

 Regards,
 Mridul





Re: PLSA

2014-07-04 Thread Denis Turdakov
Hi, Deb.

I don't quite understand the question. PLSA is an instance of matrix
factorization problem. 

If you are asking about inference algorithm, we use EM-algorithm.
Description of this approach is, for example, here:
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf


Best, Denis.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7179.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: PLSA

2014-07-04 Thread Debasish Das
Thanks for the pointer...

Looks like you are using EM algorithm for factorization which looks similar
to multiplicative update rules

Do you think using mllib ALS implicit feedback, you can scale the problem
further ?

We can handle L1, L2, equality and positivity constraints in ALS now...As
long as you can find the gradient and hessian from the KL divergence loss,
you can use that in place of gram matrix that is used in ALS right now

If you look in topic modeling work in Solr (Carrot is the package), they
use ALS to generate the topics...that algorithm looks like a simplified
version of what you are attempting here...

May be the EM algorithm for topic modeling is efficient than ALS but from
looking at it I don't see how...I see lot of broadcasts...while in implicit
feedback you need one broadcast of gram matrix...

On Fri, Jul 4, 2014 at 4:27 AM, Denis Turdakov turda...@ispras.ru wrote:

 Hi, Deb.

 I don't quite understand the question. PLSA is an instance of matrix
 factorization problem.

 If you are asking about inference algorithm, we use EM-algorithm.
 Description of this approach is, for example, here:
 http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf


 Best, Denis.



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/PLSA-tp7170p7179.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: Constraint Solver for Spark

2014-07-04 Thread Debasish Das
I looked further and realized that ECOS used a mex file while PDCO is using
pure Matlab code. So the out-of-box runtime comparison is not fair.

I am trying to generate PDCO C port. Like ECOS, PDCO also makes use of
sparse support from Tim Davis.

Thanks.
Deb


Invalid link for Spark 1.0.0 in Official Web Site

2014-07-04 Thread Kousuke Saruta
Hi,

I found there is a invalid link in http://spark.apache.org/downloads.html .
The link for release note of Spark 1.0.0 indicates 
http://spark.apache.org/releases/spark-release-1.0.0.html but this link is 
invalid.
I think that is mistake for 
http://spark.apache.org/releases/spark-release-1-0-0.html.

Thanks,
Kousuke




[RESULT] [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-07-04 Thread Patrick Wendell
This vote is cancelled in favor of RC2. Thanks to everyone who voted.

On Sun, Jun 29, 2014 at 11:23 PM, Andrew Ash and...@andrewash.com wrote:
 Ok that's reasonable -- it's certainly more of an enhancement than a
 critical bug-fix.  I would like to get this in for 1.1.0 though, so let's
 talk through the right way to do that on the PR.

 In the meantime the best alternative is running with lax firewall settings,
 which can be somewhat mitigated by modifying the ephemeral port range.

 Thanks!
 Andrew


 On Sun, Jun 29, 2014 at 11:14 PM, Reynold Xin r...@databricks.com wrote:

 Hi Andrew,

 The port stuff is great to have, but they are pretty big changes to the
 core that are introducing new features and are not exactly fixing important
 bugs. For this reason, it probably can't block a release (I'm not even sure
 if it should go into a maintenance release where we fix critical bugs for
 Spark core).

 We should definitely include them for 1.1.0 though (~Aug).




 On Sun, Jun 29, 2014 at 11:09 PM, Andrew Ash and...@andrewash.com wrote:

  Thanks for helping shepherd the voting on 1.0.1 Patrick.
 
  I'd like to call attention to
  https://issues.apache.org/jira/browse/SPARK-2157 and
  https://github.com/apache/spark/pull/1107 -- Ability to write tight
  firewall rules for Spark
 
  I'm currently unable to run Spark on some projects because our cloud ops
  team is uncomfortable with the firewall situation around Spark at the
  moment.  Currently Spark starts listening on random ephemeral ports and
  does server to server communication on them.  This keeps the team from
  writing tight firewall rules between the services -- they get real queasy
  when asked to open inbound connections to the entire ephemeral port range
  of a cluster.  We can tighten the size of the ephemeral range using
 kernel
  settings to mitigate the issue, but it doesn't actually solve the
 problem.
 
  The PR above aims to make every listening port on JVMs in a Spark
  standalone cluster configurable with an option.  If not set, the current
  behavior stands (start listening on an ephemeral port).  Is this
 something
  the Spark team would consider merging into 1.0.1?
 
  Thanks!
  Andrew
 
 
 
  On Sun, Jun 29, 2014 at 10:54 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey All,
  
   We're going to move onto another rc because of this vote.
   Unfortunately with the summit activities I haven't been able to usher
   in the necessary patches and cut the RC. I will do so as soon as
   possible and we can commence official voting.
  
   - Patrick
  
   On Sun, Jun 29, 2014 at 4:56 PM, Reynold Xin r...@databricks.com
  wrote:
We should make sure we include the following two patches:
   
https://github.com/apache/spark/pull/1264
   
https://github.com/apache/spark/pull/1263
   
   
   
   
On Fri, Jun 27, 2014 at 8:39 PM, Krishna Sankar ksanka...@gmail.com
 
   wrote:
   
+1
Compiled for CentOS 6.5, deployed in our 4 node cluster (Hadoop 2.2,
   YARN)
Smoke Tests (sparkPi,spark-shell, web UI) successful
   
Cheers
k/
   
   
On Thu, Jun 26, 2014 at 7:06 PM, Patrick Wendell 
 pwend...@gmail.com
wrote:
   
 Please vote on releasing the following candidate as Apache Spark
   version
 1.0.1!

 The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):


   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962

 The release files, including signatures, digests, etc. can be
 found
   at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

  
 https://repository.apache.org/content/repositories/orgapachespark-1020/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.0.1!

 The vote is open until Monday, June 30, at 03:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 === About this release ===
 This release fixes a few high-priority bugs in 1.0 and has a
 variety
 of smaller fixes. The full list is here: http://s.apache.org/b45.
   Some
 of the more visible patches are:

 SPARK-2043: ExternalAppendOnlyMap doesn't always find matching
 keys
 SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka
  frame
size.
 SPARK-1790: Support r3 instance types on EC2.

 This is the first maintenance release on the 1.0 line. We plan to
  make
 additional maintenance releases as new fixes come in.
  

[VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-04 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.0.1!

The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1021/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/

Please vote on releasing this package as Apache Spark 1.0.1!

The vote is open until Monday, July 07, at 20:45 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

=== Differences from RC1 ===
This release includes only one blocking patch from rc1:
https://github.com/apache/spark/pull/1255

There are also smaller fixes which came in over the last week.

=== About this release ===
This release fixes a few high-priority bugs in 1.0 and has a variety
of smaller fixes. The full list is here: http://s.apache.org/b45. Some
of the more visible patches are:

SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
SPARK-1790: Support r3 instance types on EC2.

This is the first maintenance release on the 1.0 line. We plan to make
additional maintenance releases as new fixes come in.


2nd Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2)

2014-07-04 Thread Mattmann, Chris A (3980)
(apologies for Cross Posting)

2nd Workshop on Sustainable Software for Science: Practice and Experiences
(WSSSPE2)
http://wssspe.researchcomputing.org.uk/wssspe2/
(to be held in conjunction with SC14, Sunday, 16 November 2014, New
Orleans, LA, USA)

Progress in scientific research is dependent on the quality and
accessibility of software at all levels and it is critical to address
challenges related to the development, deployment, and maintenance of
reusable software as well as education around software practices. These
challenges can be technological, policy based, organizational, and
educational, and are of interest to developers (the software community),
users (science disciplines), and researchers studying the conduct of
science (science of team science, science of organizations, science of
science and innovation policy, and social science communities).

The WSSSPE1 workshop (http://wssspe.researchcomputing.org.uk/WSSSPE1)
engaged the broad scientific community to identify challenges and best
practices in areas of interest for sustainable scientific software. At
WSSSPE2, we invite the community to propose and discuss specific
mechanisms to move towards an imagined future practice of software
development and usage in science and engineering. The workshop will
include multiple mechanisms for participation, encourage team building
around solutions, and identify risky solutions with potentially
transformative outcomes. Participation by early career students and
postdoctoral researchers is strongly encouraged.

We invite short (4-page) actionable papers that will lead to improvements
for sustainable software science. These papers could be a call to action,
or could provide position or experience reports on sustainable software
activities. The papers will be used by the organizing committee to design
sessions that will be highly interactive and targeted towards facilitating
action. Submitted papers should be archived by a third-party service that
provides DOIs. We encourage submitters to license their papers under a
Creative Commons license that encourages sharing and remixing, as we will
combine ideas (with attribution) into the outcomes of the workshop.

The organizers will invite one or more submitters of provocative papers to
start the workshop by presenting highlights of their papers in a keynote
presentation to initiate active discussion that will continue throughout
the day.

Areas of interest for WSSSPE2, include, but are not limited to:

=80 defining software sustainability in the context of science and
engineering software
=80 how to evaluate software sustainability
=80 improving the development process that leads to new software
=80 methods to develop sustainable software from the outset
=80 effective approaches to reusable software created as a by-product of
research
=80 impact of computer science research on the development of scientific
software
=80 recommendations for the support and maintenance of existing software
=80 software engineering best practices
=80 governance, business, and sustainability models
=80 the role of community software repositories, their operation and
sustainability
=80 reproducibility, transparency needs that may be unique to science
=80 successful open source software implementations
=80 incentives for using and contributing to open source software
=80 transitioning users into contributing developers
=80 building large and engaged user communities
=80 developing strong advocates
=80 measurement of usage and impact
=80 encouraging industry=B9s role in sustainability
=80 engagement of industry with volunteer communities
=80 incentives for industry
=80 incentives for community to contribute to industry-driven projects
=80 recommending policy changes
=80 software credit, attribution, incentive, and reward
=80 issues related to multiple organizations and multiple countries, such
a=
s
intellectual property, licensing, etc.
=80 mechanisms and venues for publishing software, and the role of
publishe=
rs
=80 improving education and training
=80 best practices for providing graduate students and postdoctoral
researchers in domain communities with sufficient training in software
development
=80 novel uses of sustainable software in education (K-20)
=80 case studies from students on issues around software development in the
undergraduate or graduate curricula
=80 careers and profession
=80 successful examples of career paths for developers
=80 institutional changes to support sustainable software such as promotion
and tenure metrics, job categories, etc.

Submissions:

Submissions of up to four pages should be formatted to be easily readable
and submitted to an open access repository that provides unique
identifiers (e.g., DOIs) that can be cited, for example http://arXiv.org
http://arxiv.org/
or http://figshare.com http://figshare.com/.

Once you have received an identifier for your self-published paper from a
repository, submit it to WSSSPE2 by creating a new submission at