The underlying assumption that each compute node is also a data node isn't correct either. I would like to see a better analysis of pros and cons in this discussion.
Thomas On Tue, May 10, 2016 at 12:00 AM, Thomas Weise <[email protected]> wrote: > The problem is that you cannot reallocate partitions without resetting the > downstream dependencies. We are discussing a long running app, not a > MapReduce job. > > What use case has sparked this discussion? Is there a specific request for > such a feature? > > Thanks, > Thomas > > > > On Mon, May 9, 2016 at 5:41 PM, Amol Kekre <[email protected]> wrote: > >> Spawning block readers on all data nodes will cause scale issues. For >> example on a 1000 data node cluster we cannot ask for 1000 containers for >> a >> file that has say 8 blocks. This feature has been solved by MapReduce, >> ideally we should use that part of MapReduce. I am not sure if it could be >> re-used. Assuming we cannot reuse, I am covering possible cases to >> consider. >> >> At a high level the operator wants to spawn containers where the data is. >> Given that it becomes a resource-ask call. We do have LOCALITY_HOST to >> start with in a pre-launch phase, so that is great start. As RM is asked >> for resources, we need to consider implications. I am listing some here >> (there could be more) >> 1. That node does not have container available: RM rejects or allocates >> another node >> 2. It may not be desirable to put two partitions on the same node >> // Apex already has a strong RM->resource ask protocol, so Apex again is >> on >> a good ground >> >> We can improve a bit more, as HDFS blocks would (usually) have three >> copies. We can improve the probability of success by listing all three in >> LOCALITY_HOST. Here it gets slightly complicated. Note that requesting all >> three nodes and returning 2 (in a worst case scenerio) per partition taxes >> RM, so should be avoided. >> >> The solution could be something along the following lines >> - Get node affinity from each partition with 3 choices >> - Create a first list of nodes that satisfies #1 and #2 above (or more >> constraints) >> - Then iterate till a solution is found >> -> Ask RM for node selections >> -> Check if the containers returned fit the solution >> -> Repeat till a good case is found or repeat untll N iterations >> - End iteration >> >> There are some more optimizations during iterations -> if node local does >> not work, try rack local. For now getting node local attempt (say N times, >> N may be as low as 2) would be great start. >> >> Thks, >> Amol >> >> >> On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]> >> wrote: >> >> > It is already possible to request a specific host for a partition. >> > >> > Thats true. Just saw that a Partition contains a Map of attributes and >> that >> > can contain LOCALITY_HOST. >> > >> > >> > But you may want to evaluate the cost of container allocation and need >> to >> > reset the entire DAG against the benefits that you get from data >> locality. >> > >> > I see. So instead of spawning Block Reader on all the nodes (Pramod's >> > proposal) we can spawn Block Reader on all the data nodes. >> > >> > We can then have an HDFS specific module which finds all the data nodes >> by >> > talking to NameNode and create BlockReader partitions using that. >> > >> > Chandni >> > >> > >> > On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]> >> > wrote: >> > >> > > It is already possible to request a specific host for a partition. >> > > >> > > But you may want to evaluate the cost of container allocation and >> need to >> > > reset the entire DAG against the benefits that you get from data >> > locality. >> > > >> > > -- >> > > sent from mobile >> > > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]> >> wrote: >> > > >> > > > Hi Pramod, >> > > > >> > > > I thought about this and IMO one way to achieve a little more >> > efficiently >> > > > is by providing some support from the platform and intelligent >> > > > partitioning in BlockReader. >> > > > >> > > > 1. Platform support: A partition be able to express on which node >> it >> > > > should be created. Application master then requests RM to deploy the >> > > > partition on that node. >> > > > >> > > > 2. Initially just one instance of Block Reader is created. When it >> > > receives >> > > > BlockMetadata, it can derive where the new hdfs blocks are. So it >> can >> > > > create more Partitions if there isn't a BlockReader on that node >> > already >> > > > running. >> > > > >> > > > I will like to take it up if there is some consensus to support >> this. >> > > > >> > > > Chandni >> > > > >> > > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde < >> [email protected] >> > > >> > > > wrote: >> > > > >> > > > > So the requirement is to mix runtime and deployment decisions. >> > > > > How about allowing the operators to request re-deployment based on >> > the >> > > > > runtime condition? >> > > > > >> > > > > >> > > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni < >> > [email protected] >> > > > >> > > > > wrote: >> > > > > >> > > > > > The file splitter, block reader combination allows for parallel >> > > reading >> > > > > of >> > > > > > files by multiple partitions by dividing the files into blocks. >> > Does >> > > > > anyone >> > > > > > have any ideas on how to have the block readers be data local to >> > the >> > > > > blocks >> > > > > > they are reading. >> > > > > > >> > > > > > I think we will need to spawn block readers on all nodes where >> the >> > > > block >> > > > > > are present and if the readers are reading multiple files this >> > could >> > > > mean >> > > > > > all the nodes in the cluster and route the block meta >> information >> > to >> > > > the >> > > > > > appropriate block reader. >> > > > > > >> > > > > > Thanks >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
