Re: distributed cache

Harsh J Wed, 26 Dec 2012 02:21:53 -0800

There is no limitation in HDFS that limits reads of a block to a
single client at a time (no reason to do so) - so downloads can be as
concurrent as possible.


On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <lin...@gmail.com> wrote:
> Thanks Harsh,
>
> Supposing DistributedCache is uploaded by client, for each replica, in
> Hadoop design, it could only serve one download session (download from a
> mapper or a reducer which requires the DistributedCache) at a time until
> DistributedCache file download is completed, or it could serve multiple
> concurrent parallel download session (download from multiple mappers or
> reducers which requires the DistributedCache).
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> DistributedCache files are stored onto the HDFS by the client first.
>> The TaskTrackers download and localize it. Therefore, as with any
>> other file on HDFS, "downloads" can be efficiently parallel with
>> higher replicas.
>>
>> The point of having higher replication for these files is also tied to
>> the concept of racks in a cluster - you would want more replicas
>> spread across racks such that on task bootup the downloads happen with
>> rack locality.
>>
>> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <lin...@gmail.com> wrote:
>> > Hi Kai,
>> >
>> > Smart answer! :-)
>> >
>> > The assumption you have is one distributed cache replica could only
>> > serve
>> > one download session for tasktracker node (this is why you get
>> > concurrency
>> > n/r). The question is, why one distributed cache replica cannot serve
>> > multiple concurrent download session? For example, supposing a
>> > tasktracker
>> > use elapsed time t to download a file from a specific distributed cache
>> > replica, it is possible for 2 tasktrackers to download from the specific
>> > distributed cache replica in parallel using elapsed time t as well, or
>> > 1.5
>> > t, which is faster than sequential download time 2t you mentioned
>> > before?
>> > "In total, r+n/r concurrent operations. If you optimize r depending on
>> > n,
>> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
>> > minimize r+n/r? Appreciate if you could point me to more details.
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote:
>> >>
>> >> Hi,
>> >>
>> >> simple math. Assuming you have n TaskTrackers in your cluster that will
>> >> need to access the files in the distributed cache. And r is the
>> >> replication
>> >> level of those files.
>> >>
>> >> Copying the files into HDFS requires r copy operations over the
>> >> network.
>> >> The n TaskTrackers need to get their local copies from HDFS, so the n
>> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
>> >> total,
>> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n)
>> >> is
>> >> the optimal replication level. So 10 is a reasonable default setting
>> >> for
>> >> most clusters that are not 500+ nodes big.
>> >>
>> >> Kai
>> >>
>> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <lin...@gmail.com>:
>> >>
>> >> Thanks Kai, using higher replication count for the purpose of?
>> >>
>> >> regards,
>> >> Lin
>> >>
>> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <lin...@gmail.com>:
>> >>>
>> >>> > I want to confirm when on each task node either mapper or reducer
>> >>> > access distributed cache file, it resides on disk, not resides in
>> >>> > memory.
>> >>> > Just want to make sure distributed cache file does not fully loaded
>> >>> > into
>> >>> > memory which compete memory consumption with mapper/reducer tasks.
>> >>> > Is that
>> >>> > correct?
>> >>>
>> >>>
>> >>> Yes, you are correct. The JobTracker will put files for the
>> >>> distributed
>> >>> cache into HDFS with a higher replication count (10 by default).
>> >>> Whenever a
>> >>> TaskTracker needs those files for a task it is launching locally, it
>> >>> will
>> >>> fetch a copy to its local disk. So it won't need to do this again for
>> >>> future
>> >>> tasks on this node. After a job is done, all local copies and the HDFS
>> >>> copies of files in the distributed cache are cleaned up.
>> >>>
>> >>> Kai
>> >>>
>> >>> --
>> >>> Kai Voigt
>> >>> k...@123.org
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Kai Voigt
>> >> k...@123.org
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: distributed cache

Reply via email to