The problem is that you cannot reallocate partitions without resetting the downstream dependencies. We are discussing a long running app, not a MapReduce job.
What use case has sparked this discussion? Is there a specific request for such a feature? Thanks, Thomas On Mon, May 9, 2016 at 5:41 PM, Amol Kekre <[email protected]> wrote: > Spawning block readers on all data nodes will cause scale issues. For > example on a 1000 data node cluster we cannot ask for 1000 containers for a > file that has say 8 blocks. This feature has been solved by MapReduce, > ideally we should use that part of MapReduce. I am not sure if it could be > re-used. Assuming we cannot reuse, I am covering possible cases to > consider. > > At a high level the operator wants to spawn containers where the data is. > Given that it becomes a resource-ask call. We do have LOCALITY_HOST to > start with in a pre-launch phase, so that is great start. As RM is asked > for resources, we need to consider implications. I am listing some here > (there could be more) > 1. That node does not have container available: RM rejects or allocates > another node > 2. It may not be desirable to put two partitions on the same node > // Apex already has a strong RM->resource ask protocol, so Apex again is on > a good ground > > We can improve a bit more, as HDFS blocks would (usually) have three > copies. We can improve the probability of success by listing all three in > LOCALITY_HOST. Here it gets slightly complicated. Note that requesting all > three nodes and returning 2 (in a worst case scenerio) per partition taxes > RM, so should be avoided. > > The solution could be something along the following lines > - Get node affinity from each partition with 3 choices > - Create a first list of nodes that satisfies #1 and #2 above (or more > constraints) > - Then iterate till a solution is found > -> Ask RM for node selections > -> Check if the containers returned fit the solution > -> Repeat till a good case is found or repeat untll N iterations > - End iteration > > There are some more optimizations during iterations -> if node local does > not work, try rack local. For now getting node local attempt (say N times, > N may be as low as 2) would be great start. > > Thks, > Amol > > > On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]> > wrote: > > > It is already possible to request a specific host for a partition. > > > > Thats true. Just saw that a Partition contains a Map of attributes and > that > > can contain LOCALITY_HOST. > > > > > > But you may want to evaluate the cost of container allocation and need to > > reset the entire DAG against the benefits that you get from data > locality. > > > > I see. So instead of spawning Block Reader on all the nodes (Pramod's > > proposal) we can spawn Block Reader on all the data nodes. > > > > We can then have an HDFS specific module which finds all the data nodes > by > > talking to NameNode and create BlockReader partitions using that. > > > > Chandni > > > > > > On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]> > > wrote: > > > > > It is already possible to request a specific host for a partition. > > > > > > But you may want to evaluate the cost of container allocation and need > to > > > reset the entire DAG against the benefits that you get from data > > locality. > > > > > > -- > > > sent from mobile > > > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]> > wrote: > > > > > > > Hi Pramod, > > > > > > > > I thought about this and IMO one way to achieve a little more > > efficiently > > > > is by providing some support from the platform and intelligent > > > > partitioning in BlockReader. > > > > > > > > 1. Platform support: A partition be able to express on which node it > > > > should be created. Application master then requests RM to deploy the > > > > partition on that node. > > > > > > > > 2. Initially just one instance of Block Reader is created. When it > > > receives > > > > BlockMetadata, it can derive where the new hdfs blocks are. So it can > > > > create more Partitions if there isn't a BlockReader on that node > > already > > > > running. > > > > > > > > I will like to take it up if there is some consensus to support this. > > > > > > > > Chandni > > > > > > > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde < > [email protected] > > > > > > > wrote: > > > > > > > > > So the requirement is to mix runtime and deployment decisions. > > > > > How about allowing the operators to request re-deployment based on > > the > > > > > runtime condition? > > > > > > > > > > > > > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > The file splitter, block reader combination allows for parallel > > > reading > > > > > of > > > > > > files by multiple partitions by dividing the files into blocks. > > Does > > > > > anyone > > > > > > have any ideas on how to have the block readers be data local to > > the > > > > > blocks > > > > > > they are reading. > > > > > > > > > > > > I think we will need to spawn block readers on all nodes where > the > > > > block > > > > > > are present and if the readers are reading multiple files this > > could > > > > mean > > > > > > all the nodes in the cluster and route the block meta information > > to > > > > the > > > > > > appropriate block reader. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > >
