Re: Block reading and data locality

Chandni Singh Mon, 09 May 2016 16:22:59 -0700

It is already possible to request a specific host for a partition.

Thats true. Just saw that a Partition contains a Map of attributes and that
can contain LOCALITY_HOST.



But you may want to evaluate the cost of container allocation and need to
reset the entire DAG against the benefits that you get from data locality.

I see. So instead of spawning Block Reader on all the nodes (Pramod's
proposal) we can spawn Block Reader on all the data nodes.

We can then have an HDFS specific module which finds all the data nodes by
talking to NameNode and create BlockReader partitions using that.

Chandni


On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]> wrote:

> It is already possible to request a specific host for a partition.
>
> But you may want to evaluate the cost of container allocation and need to
> reset the entire DAG against the benefits that you get from data locality.
>
> --
> sent from mobile
> On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]> wrote:
>
> > Hi Pramod,
> >
> > I thought about this and IMO one way to achieve a little more efficiently
> >  is by providing some support from the platform and intelligent
> > partitioning in BlockReader.
> >
> > 1.  Platform support: A partition be able to express on which node it
> > should be created. Application master then requests RM to deploy the
> > partition on that node.
> >
> > 2. Initially just one instance of Block Reader is created. When it
> receives
> > BlockMetadata, it can derive where the new hdfs blocks are. So it can
> > create more Partitions if there isn't a BlockReader on that node already
> > running.
> >
> > I will like to take it up if there is some consensus to support this.
> >
> > Chandni
> >
> > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde <[email protected]>
> > wrote:
> >
> > > So the requirement is to mix runtime and deployment decisions.
> > > How about allowing the operators to request re-deployment based on the
> > > runtime condition?
> > >
> > >
> > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni <[email protected]
> >
> > > wrote:
> > >
> > > > The file splitter, block reader combination allows for parallel
> reading
> > > of
> > > > files by multiple partitions by dividing the files into blocks. Does
> > > anyone
> > > > have any ideas on how to have the block readers be data local to the
> > > blocks
> > > > they are reading.
> > > >
> > > > I think we will need to spawn block readers on all nodes where the
> > block
> > > > are present and if the readers are reading multiple files this could
> > mean
> > > > all the nodes in the cluster and route the block meta information to
> > the
> > > > appropriate block reader.
> > > >
> > > > Thanks
> > > >
> > >
> >
>

Re: Block reading and data locality

Reply via email to