There is no limitation in HDFS that limits reads of a block to a single client at a time (no reason to do so) - so downloads can be as concurrent as possible.
On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <lin...@gmail.com> wrote: > Thanks Harsh, > > Supposing DistributedCache is uploaded by client, for each replica, in > Hadoop design, it could only serve one download session (download from a > mapper or a reducer which requires the DistributedCache) at a time until > DistributedCache file download is completed, or it could serve multiple > concurrent parallel download session (download from multiple mappers or > reducers which requires the DistributedCache). > > regards, > Lin > > > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <ha...@cloudera.com> wrote: >> >> Hi Lin, >> >> DistributedCache files are stored onto the HDFS by the client first. >> The TaskTrackers download and localize it. Therefore, as with any >> other file on HDFS, "downloads" can be efficiently parallel with >> higher replicas. >> >> The point of having higher replication for these files is also tied to >> the concept of racks in a cluster - you would want more replicas >> spread across racks such that on task bootup the downloads happen with >> rack locality. >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <lin...@gmail.com> wrote: >> > Hi Kai, >> > >> > Smart answer! :-) >> > >> > The assumption you have is one distributed cache replica could only >> > serve >> > one download session for tasktracker node (this is why you get >> > concurrency >> > n/r). The question is, why one distributed cache replica cannot serve >> > multiple concurrent download session? For example, supposing a >> > tasktracker >> > use elapsed time t to download a file from a specific distributed cache >> > replica, it is possible for 2 tasktrackers to download from the specific >> > distributed cache replica in parallel using elapsed time t as well, or >> > 1.5 >> > t, which is faster than sequential download time 2t you mentioned >> > before? >> > "In total, r+n/r concurrent operations. If you optimize r depending on >> > n, >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for >> > minimize r+n/r? Appreciate if you could point me to more details. >> > >> > regards, >> > Lin >> > >> > >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote: >> >> >> >> Hi, >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that will >> >> need to access the files in the distributed cache. And r is the >> >> replication >> >> level of those files. >> >> >> >> Copying the files into HDFS requires r copy operations over the >> >> network. >> >> The n TaskTrackers need to get their local copies from HDFS, so the n >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In >> >> total, >> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) >> >> is >> >> the optimal replication level. So 10 is a reasonable default setting >> >> for >> >> most clusters that are not 500+ nodes big. >> >> >> >> Kai >> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <lin...@gmail.com>: >> >> >> >> Thanks Kai, using higher replication count for the purpose of? >> >> >> >> regards, >> >> Lin >> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote: >> >>> >> >>> Hi, >> >>> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <lin...@gmail.com>: >> >>> >> >>> > I want to confirm when on each task node either mapper or reducer >> >>> > access distributed cache file, it resides on disk, not resides in >> >>> > memory. >> >>> > Just want to make sure distributed cache file does not fully loaded >> >>> > into >> >>> > memory which compete memory consumption with mapper/reducer tasks. >> >>> > Is that >> >>> > correct? >> >>> >> >>> >> >>> Yes, you are correct. The JobTracker will put files for the >> >>> distributed >> >>> cache into HDFS with a higher replication count (10 by default). >> >>> Whenever a >> >>> TaskTracker needs those files for a task it is launching locally, it >> >>> will >> >>> fetch a copy to its local disk. So it won't need to do this again for >> >>> future >> >>> tasks on this node. After a job is done, all local copies and the HDFS >> >>> copies of files in the distributed cache are cleaned up. >> >>> >> >>> Kai >> >>> >> >>> -- >> >>> Kai Voigt >> >>> k...@123.org >> >>> >> >>> >> >>> >> >>> >> >> >> >> >> >> -- >> >> Kai Voigt >> >> k...@123.org >> >> >> >> >> >> >> >> >> > >> >> >> >> -- >> Harsh J > > -- Harsh J