Re: Block reading and data locality

Amol Kekre Mon, 09 May 2016 17:42:07 -0700

Spawning block readers on all data nodes will cause scale issues. For
example on a 1000 data node cluster we cannot ask for 1000 containers for a
file that has say 8 blocks. This feature has been solved by MapReduce,
ideally we should use that part of MapReduce. I am not sure if it could be
re-used. Assuming we cannot reuse, I am covering possible cases to consider.


At a high level the operator wants to spawn containers where the data is.
Given that it becomes a resource-ask call. We do have LOCALITY_HOST to
start with in a pre-launch phase, so that is great start. As RM is asked
for resources, we need to consider implications. I am listing some here
(there could be more)
1. That node does not have container available: RM rejects or allocates
another node
2. It may not be desirable to put two partitions on the same node
// Apex already has a strong RM->resource ask protocol, so Apex again is on
a good ground

We can improve a bit more, as HDFS blocks would (usually) have three
copies. We can improve the probability of success by listing all three in
LOCALITY_HOST. Here it gets slightly complicated. Note that requesting all
three nodes and returning 2 (in a worst case scenerio) per partition taxes
RM, so should be avoided.

The solution could be something along the following lines
- Get node affinity from each partition with 3 choices
- Create a first list of nodes that satisfies #1 and #2 above (or more
constraints)
- Then iterate till a solution is found
  -> Ask RM for node selections
  -> Check if the containers returned fit the solution
  -> Repeat till a good case is found or repeat untll N iterations
- End iteration

There are some more optimizations during iterations -> if node local does
not work, try rack local. For now getting node local attempt (say N times,
N may be as low as 2) would be great start.

Thks,
Amol


On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]>
wrote:

> It is already possible to request a specific host for a partition.
>
> Thats true. Just saw that a Partition contains a Map of attributes and that
> can contain LOCALITY_HOST.
>
>
> But you may want to evaluate the cost of container allocation and need to
> reset the entire DAG against the benefits that you get from data locality.
>
> I see. So instead of spawning Block Reader on all the nodes (Pramod's
> proposal) we can spawn Block Reader on all the data nodes.
>
> We can then have an HDFS specific module which finds all the data nodes by
> talking to NameNode and create BlockReader partitions using that.
>
> Chandni
>
>
> On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]>
> wrote:
>
> > It is already possible to request a specific host for a partition.
> >
> > But you may want to evaluate the cost of container allocation and need to
> > reset the entire DAG against the benefits that you get from data
> locality.
> >
> > --
> > sent from mobile
> > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]> wrote:
> >
> > > Hi Pramod,
> > >
> > > I thought about this and IMO one way to achieve a little more
> efficiently
> > >  is by providing some support from the platform and intelligent
> > > partitioning in BlockReader.
> > >
> > > 1.  Platform support: A partition be able to express on which node it
> > > should be created. Application master then requests RM to deploy the
> > > partition on that node.
> > >
> > > 2. Initially just one instance of Block Reader is created. When it
> > receives
> > > BlockMetadata, it can derive where the new hdfs blocks are. So it can
> > > create more Partitions if there isn't a BlockReader on that node
> already
> > > running.
> > >
> > > I will like to take it up if there is some consensus to support this.
> > >
> > > Chandni
> > >
> > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde <[email protected]
> >
> > > wrote:
> > >
> > > > So the requirement is to mix runtime and deployment decisions.
> > > > How about allowing the operators to request re-deployment based on
> the
> > > > runtime condition?
> > > >
> > > >
> > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > The file splitter, block reader combination allows for parallel
> > reading
> > > > of
> > > > > files by multiple partitions by dividing the files into blocks.
> Does
> > > > anyone
> > > > > have any ideas on how to have the block readers be data local to
> the
> > > > blocks
> > > > > they are reading.
> > > > >
> > > > > I think we will need to spawn block readers on all nodes where the
> > > block
> > > > > are present and if the readers are reading multiple files this
> could
> > > mean
> > > > > all the nodes in the cluster and route the block meta information
> to
> > > the
> > > > > appropriate block reader.
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>

Re: Block reading and data locality

Reply via email to