Spawning block readers on all data nodes will cause scale issues. For example on a 1000 data node cluster we cannot ask for 1000 containers for a file that has say 8 blocks. This feature has been solved by MapReduce, ideally we should use that part of MapReduce. I am not sure if it could be re-used. Assuming we cannot reuse, I am covering possible cases to consider.
At a high level the operator wants to spawn containers where the data is. Given that it becomes a resource-ask call. We do have LOCALITY_HOST to start with in a pre-launch phase, so that is great start. As RM is asked for resources, we need to consider implications. I am listing some here (there could be more) 1. That node does not have container available: RM rejects or allocates another node 2. It may not be desirable to put two partitions on the same node // Apex already has a strong RM->resource ask protocol, so Apex again is on a good ground We can improve a bit more, as HDFS blocks would (usually) have three copies. We can improve the probability of success by listing all three in LOCALITY_HOST. Here it gets slightly complicated. Note that requesting all three nodes and returning 2 (in a worst case scenerio) per partition taxes RM, so should be avoided. The solution could be something along the following lines - Get node affinity from each partition with 3 choices - Create a first list of nodes that satisfies #1 and #2 above (or more constraints) - Then iterate till a solution is found -> Ask RM for node selections -> Check if the containers returned fit the solution -> Repeat till a good case is found or repeat untll N iterations - End iteration There are some more optimizations during iterations -> if node local does not work, try rack local. For now getting node local attempt (say N times, N may be as low as 2) would be great start. Thks, Amol On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]> wrote: > It is already possible to request a specific host for a partition. > > Thats true. Just saw that a Partition contains a Map of attributes and that > can contain LOCALITY_HOST. > > > But you may want to evaluate the cost of container allocation and need to > reset the entire DAG against the benefits that you get from data locality. > > I see. So instead of spawning Block Reader on all the nodes (Pramod's > proposal) we can spawn Block Reader on all the data nodes. > > We can then have an HDFS specific module which finds all the data nodes by > talking to NameNode and create BlockReader partitions using that. > > Chandni > > > On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]> > wrote: > > > It is already possible to request a specific host for a partition. > > > > But you may want to evaluate the cost of container allocation and need to > > reset the entire DAG against the benefits that you get from data > locality. > > > > -- > > sent from mobile > > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]> wrote: > > > > > Hi Pramod, > > > > > > I thought about this and IMO one way to achieve a little more > efficiently > > > is by providing some support from the platform and intelligent > > > partitioning in BlockReader. > > > > > > 1. Platform support: A partition be able to express on which node it > > > should be created. Application master then requests RM to deploy the > > > partition on that node. > > > > > > 2. Initially just one instance of Block Reader is created. When it > > receives > > > BlockMetadata, it can derive where the new hdfs blocks are. So it can > > > create more Partitions if there isn't a BlockReader on that node > already > > > running. > > > > > > I will like to take it up if there is some consensus to support this. > > > > > > Chandni > > > > > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde <[email protected] > > > > > wrote: > > > > > > > So the requirement is to mix runtime and deployment decisions. > > > > How about allowing the operators to request re-deployment based on > the > > > > runtime condition? > > > > > > > > > > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni < > [email protected] > > > > > > > wrote: > > > > > > > > > The file splitter, block reader combination allows for parallel > > reading > > > > of > > > > > files by multiple partitions by dividing the files into blocks. > Does > > > > anyone > > > > > have any ideas on how to have the block readers be data local to > the > > > > blocks > > > > > they are reading. > > > > > > > > > > I think we will need to spawn block readers on all nodes where the > > > block > > > > > are present and if the readers are reading multiple files this > could > > > mean > > > > > all the nodes in the cluster and route the block meta information > to > > > the > > > > > appropriate block reader. > > > > > > > > > > Thanks > > > > > > > > > > > > > > >
