The problem is that you cannot reallocate partitions without resetting the
downstream dependencies. We are discussing a long running app, not a
MapReduce job.

What use case has sparked this discussion? Is there a specific request for
such a feature?

Thanks,
Thomas



On Mon, May 9, 2016 at 5:41 PM, Amol Kekre <[email protected]> wrote:

> Spawning block readers on all data nodes will cause scale issues. For
> example on a 1000 data node cluster we cannot ask for 1000 containers for a
> file that has say 8 blocks. This feature has been solved by MapReduce,
> ideally we should use that part of MapReduce. I am not sure if it could be
> re-used. Assuming we cannot reuse, I am covering possible cases to
> consider.
>
> At a high level the operator wants to spawn containers where the data is.
> Given that it becomes a resource-ask call. We do have LOCALITY_HOST to
> start with in a pre-launch phase, so that is great start. As RM is asked
> for resources, we need to consider implications. I am listing some here
> (there could be more)
> 1. That node does not have container available: RM rejects or allocates
> another node
> 2. It may not be desirable to put two partitions on the same node
> // Apex already has a strong RM->resource ask protocol, so Apex again is on
> a good ground
>
> We can improve a bit more, as HDFS blocks would (usually) have three
> copies. We can improve the probability of success by listing all three in
> LOCALITY_HOST. Here it gets slightly complicated. Note that requesting all
> three nodes and returning 2 (in a worst case scenerio) per partition taxes
> RM, so should be avoided.
>
> The solution could be something along the following lines
> - Get node affinity from each partition with 3 choices
> - Create a first list of nodes that satisfies #1 and #2 above (or more
> constraints)
> - Then iterate till a solution is found
>   -> Ask RM for node selections
>   -> Check if the containers returned fit the solution
>   -> Repeat till a good case is found or repeat untll N iterations
> - End iteration
>
> There are some more optimizations during iterations -> if node local does
> not work, try rack local. For now getting node local attempt (say N times,
> N may be as low as 2) would be great start.
>
> Thks,
> Amol
>
>
> On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]>
> wrote:
>
> > It is already possible to request a specific host for a partition.
> >
> > Thats true. Just saw that a Partition contains a Map of attributes and
> that
> > can contain LOCALITY_HOST.
> >
> >
> > But you may want to evaluate the cost of container allocation and need to
> > reset the entire DAG against the benefits that you get from data
> locality.
> >
> > I see. So instead of spawning Block Reader on all the nodes (Pramod's
> > proposal) we can spawn Block Reader on all the data nodes.
> >
> > We can then have an HDFS specific module which finds all the data nodes
> by
> > talking to NameNode and create BlockReader partitions using that.
> >
> > Chandni
> >
> >
> > On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]>
> > wrote:
> >
> > > It is already possible to request a specific host for a partition.
> > >
> > > But you may want to evaluate the cost of container allocation and need
> to
> > > reset the entire DAG against the benefits that you get from data
> > locality.
> > >
> > > --
> > > sent from mobile
> > > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]>
> wrote:
> > >
> > > > Hi Pramod,
> > > >
> > > > I thought about this and IMO one way to achieve a little more
> > efficiently
> > > >  is by providing some support from the platform and intelligent
> > > > partitioning in BlockReader.
> > > >
> > > > 1.  Platform support: A partition be able to express on which node it
> > > > should be created. Application master then requests RM to deploy the
> > > > partition on that node.
> > > >
> > > > 2. Initially just one instance of Block Reader is created. When it
> > > receives
> > > > BlockMetadata, it can derive where the new hdfs blocks are. So it can
> > > > create more Partitions if there isn't a BlockReader on that node
> > already
> > > > running.
> > > >
> > > > I will like to take it up if there is some consensus to support this.
> > > >
> > > > Chandni
> > > >
> > > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > So the requirement is to mix runtime and deployment decisions.
> > > > > How about allowing the operators to request re-deployment based on
> > the
> > > > > runtime condition?
> > > > >
> > > > >
> > > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > The file splitter, block reader combination allows for parallel
> > > reading
> > > > > of
> > > > > > files by multiple partitions by dividing the files into blocks.
> > Does
> > > > > anyone
> > > > > > have any ideas on how to have the block readers be data local to
> > the
> > > > > blocks
> > > > > > they are reading.
> > > > > >
> > > > > > I think we will need to spawn block readers on all nodes where
> the
> > > > block
> > > > > > are present and if the readers are reading multiple files this
> > could
> > > > mean
> > > > > > all the nodes in the cluster and route the block meta information
> > to
> > > > the
> > > > > > appropriate block reader.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to