Re: Block reading and data locality

Thomas Weise Tue, 10 May 2016 07:52:44 -0700

The underlying assumption that each compute node is also a data node isn't
correct either. I would like to see a better analysis of pros and cons in
this discussion.


Thomas

On Tue, May 10, 2016 at 12:00 AM, Thomas Weise <[email protected]>
wrote:

> The problem is that you cannot reallocate partitions without resetting the
> downstream dependencies. We are discussing a long running app, not a
> MapReduce job.
>
> What use case has sparked this discussion? Is there a specific request for
> such a feature?
>
> Thanks,
> Thomas
>
>
>
> On Mon, May 9, 2016 at 5:41 PM, Amol Kekre <[email protected]> wrote:
>
>> Spawning block readers on all data nodes will cause scale issues. For
>> example on a 1000 data node cluster we cannot ask for 1000 containers for
>> a
>> file that has say 8 blocks. This feature has been solved by MapReduce,
>> ideally we should use that part of MapReduce. I am not sure if it could be
>> re-used. Assuming we cannot reuse, I am covering possible cases to
>> consider.
>>
>> At a high level the operator wants to spawn containers where the data is.
>> Given that it becomes a resource-ask call. We do have LOCALITY_HOST to
>> start with in a pre-launch phase, so that is great start. As RM is asked
>> for resources, we need to consider implications. I am listing some here
>> (there could be more)
>> 1. That node does not have container available: RM rejects or allocates
>> another node
>> 2. It may not be desirable to put two partitions on the same node
>> // Apex already has a strong RM->resource ask protocol, so Apex again is
>> on
>> a good ground
>>
>> We can improve a bit more, as HDFS blocks would (usually) have three
>> copies. We can improve the probability of success by listing all three in
>> LOCALITY_HOST. Here it gets slightly complicated. Note that requesting all
>> three nodes and returning 2 (in a worst case scenerio) per partition taxes
>> RM, so should be avoided.
>>
>> The solution could be something along the following lines
>> - Get node affinity from each partition with 3 choices
>> - Create a first list of nodes that satisfies #1 and #2 above (or more
>> constraints)
>> - Then iterate till a solution is found
>>   -> Ask RM for node selections
>>   -> Check if the containers returned fit the solution
>>   -> Repeat till a good case is found or repeat untll N iterations
>> - End iteration
>>
>> There are some more optimizations during iterations -> if node local does
>> not work, try rack local. For now getting node local attempt (say N times,
>> N may be as low as 2) would be great start.
>>
>> Thks,
>> Amol
>>
>>
>> On Mon, May 9, 2016 at 4:22 PM, Chandni Singh <[email protected]>
>> wrote:
>>
>> > It is already possible to request a specific host for a partition.
>> >
>> > Thats true. Just saw that a Partition contains a Map of attributes and
>> that
>> > can contain LOCALITY_HOST.
>> >
>> >
>> > But you may want to evaluate the cost of container allocation and need
>> to
>> > reset the entire DAG against the benefits that you get from data
>> locality.
>> >
>> > I see. So instead of spawning Block Reader on all the nodes (Pramod's
>> > proposal) we can spawn Block Reader on all the data nodes.
>> >
>> > We can then have an HDFS specific module which finds all the data nodes
>> by
>> > talking to NameNode and create BlockReader partitions using that.
>> >
>> > Chandni
>> >
>> >
>> > On Mon, May 9, 2016 at 3:59 PM, Thomas Weise <[email protected]>
>> > wrote:
>> >
>> > > It is already possible to request a specific host for a partition.
>> > >
>> > > But you may want to evaluate the cost of container allocation and
>> need to
>> > > reset the entire DAG against the benefits that you get from data
>> > locality.
>> > >
>> > > --
>> > > sent from mobile
>> > > On May 9, 2016 2:59 PM, "Chandni Singh" <[email protected]>
>> wrote:
>> > >
>> > > > Hi Pramod,
>> > > >
>> > > > I thought about this and IMO one way to achieve a little more
>> > efficiently
>> > > >  is by providing some support from the platform and intelligent
>> > > > partitioning in BlockReader.
>> > > >
>> > > > 1.  Platform support: A partition be able to express on which node
>> it
>> > > > should be created. Application master then requests RM to deploy the
>> > > > partition on that node.
>> > > >
>> > > > 2. Initially just one instance of Block Reader is created. When it
>> > > receives
>> > > > BlockMetadata, it can derive where the new hdfs blocks are. So it
>> can
>> > > > create more Partitions if there isn't a BlockReader on that node
>> > already
>> > > > running.
>> > > >
>> > > > I will like to take it up if there is some consensus to support
>> this.
>> > > >
>> > > > Chandni
>> > > >
>> > > > On Mon, May 9, 2016 at 2:56 PM, Sandesh Hegde <
>> [email protected]
>> > >
>> > > > wrote:
>> > > >
>> > > > > So the requirement is to mix runtime and deployment decisions.
>> > > > > How about allowing the operators to request re-deployment based on
>> > the
>> > > > > runtime condition?
>> > > > >
>> > > > >
>> > > > > On Mon, May 9, 2016 at 2:33 PM Pramod Immaneni <
>> > [email protected]
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > The file splitter, block reader combination allows for parallel
>> > > reading
>> > > > > of
>> > > > > > files by multiple partitions by dividing the files into blocks.
>> > Does
>> > > > > anyone
>> > > > > > have any ideas on how to have the block readers be data local to
>> > the
>> > > > > blocks
>> > > > > > they are reading.
>> > > > > >
>> > > > > > I think we will need to spawn block readers on all nodes where
>> the
>> > > > block
>> > > > > > are present and if the readers are reading multiple files this
>> > could
>> > > > mean
>> > > > > > all the nodes in the cluster and route the block meta
>> information
>> > to
>> > > > the
>> > > > > > appropriate block reader.
>> > > > > >
>> > > > > > Thanks
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Block reading and data locality

Reply via email to