GitHub user shubhamchopra opened a pull request:

    https://github.com/apache/spark/pull/13932

    [SPARK-15354] [CORE] [WIP] Topology aware block replication strategies

    ## What changes were proposed in this pull request?
    
    Implementations of strategies for resilient block replication for different 
resource managers that replicate the 3-replica strategy used by HDFS, where the 
first replica is on an executor, the second replica within the same rack as the 
executor and a third replica on a different rack. 
    The implementation involves providing two pluggable classes, one running in 
the driver that provides topology information for every host at cluster start 
and the second prioritizing a list of peer BlockManagerIds.
    
    The prioritization itself can be thought of an optimization problem to find 
a minimal set of peers that satisfy certain objectives and replicating to these 
peers first. The objectives can be used to express richer constraints over and 
above HDFS like 3-replica strategy. 
    
    ## How was this patch tested?
    
    This patch was tested with unit tests for storage, along with new unit 
tests to verify prioritization behaviour.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shubhamchopra/spark PrioritizerStrategy

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13932
    
----
commit 779ce27dbeedd4d5c72e28782c9d38af51d2060c
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-05T22:06:14Z

    Adding capability to prioritize peer executors based on rack awareness 
while replicating blocks.

commit d0b6747f1fc9a0b701ab41fe5cf67939ed36cb9e
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-06T17:40:47Z

    Minor modifications to get past the style check errors.

commit 942908ac060fbdd29d0efd1f8541436bf9cd46d8
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-06T20:31:22Z

    Using blockId hashcode as a source of randomness, so we don't keep choosing 
the same peers for replication.

commit 0902e39fc7a2526539013e67c48bc13b6991bf07
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-09T20:36:53Z

    Several changes:
    1. Adding rack attribute to hashcode and equals to block manager id.
    2. Removing boolean check for rack awareness. Asking master for rack info, 
and master uses topology mapper.
    3. Adding a topology mapper trait and a default implementation that block 
manager master endpoint uses to discern topology information.

commit 86e1e0212b0dae0d598f0128c6a7b8f33429dc27
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-09T20:58:21Z

    Adding null check so a Block Manager can be initiaziled without the master.

commit a3b50ae9bcca7e871d384fa4614b2c77ac5ff5ad
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-12T21:09:16Z

    Renaming classes/variables from rack to a more general topology.

commit 1ee7948ce3994df08119418b779f8cc2e5aaca86
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-12T21:15:46Z

    Renaming classes/variables from rack to a more general topology.

commit 8de5c6e39cd0a868094803a0f53b3b50b7ed90d5
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-12T21:27:29Z

    We continue to randomly choose peers, so there is no change in current 
behavior.

commit 72ae37d64724423c65d3a23559a5f46649ffa4c3
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-13T15:36:17Z

    Spelling correction and minor changes in comments to use a more general 
topology instead of rack.

commit e071ca3a838193efad715764cc654507ee254e44
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-13T20:32:13Z

    Minor change. Changing replication info message to debug level.

commit 96aaf6ec50ae943c1345966cfc11fd4180ddfa3a
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-16T21:47:33Z

    Providing peersReplicateTo to the prioritizer.

commit d125188d633744cfeddf5b0436b3217ef87a2220
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-17T19:25:34Z

    Adding developer api annotations to TopologyMapper and 
BlockReplicationPrioritization

commit 16a1ce89c5b48c3770de1e32519c8690de296058
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-18T20:52:22Z

    Changes recommended by @HyukjinKwon to fix style issues.

commit da4568e03e3690781bb03e2df2e587ceecd59bf0
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-20T18:43:07Z

    Updating prioritizer api to use current blockmanager id for self 
identification.

commit dc1cfeace2e90a3d8e0240dc8fe3540144daaef4
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-24T18:09:08Z

    Adding a set-cover formulation for picking peers to replicate blocks.

commit 165e8814afeb0cebceef0c49bd6c715f70bed0fb
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-05-24T18:11:45Z

    Adding newline to the end of file

commit 7e4d4f1e43c0cc61dbaa0baea2ea22ab3371d8a4
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-06-14T22:31:52Z

    Modifying the optimizer to use a modified greedy optimizer. We now try to 
get peers making sure objectives previously satisfied are not violated.

commit 0aecd219865e6b150aad3c04bfbd465a15f289ff
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-06-16T22:45:41Z

    Making sure we consider peers we have previously replicated to.

commit 6e5b4a6a194ffa57d5470527af9a20a65ceb01f0
Author: Shubham Chopra <schopr...@bloomberg.net>
Date:   2016-06-24T22:02:11Z

    1. Fixing topology mapper class issue, so we instantiate it correctly. 2. 
Fixing style issues

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to