Hi Kai, Smart answer! :-)
- The assumption you have is one distributed cache replica could only serve one download session for tasktracker node (this is why you get concurrency n/r). The question is, why one distributed cache replica cannot serve multiple concurrent download session? For example, supposing a tasktracker use elapsed time t to download a file from a specific distributed cache replica, it is possible for 2 tasktrackers to download from the specific distributed cache replica in parallel using elapsed time t as well, or 1.5 t, which is faster than sequential download time 2t you mentioned before? - "In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for minimize r+n/r? Appreciate if you could point me to more details. regards, Lin On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k...@123.org> wrote: > Hi, > > simple math. Assuming you have n TaskTrackers in your cluster that will > need to access the files in the distributed cache. And r is the replication > level of those files. > > Copying the files into HDFS requires r copy operations over the network. > The n TaskTrackers need to get their local copies from HDFS, so the n > TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, > r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is > the optimal replication level. So 10 is a reasonable default setting for > most clusters that are not 500+ nodes big. > > Kai > > Am 22.12.2012 um 13:46 schrieb Lin Ma <lin...@gmail.com>: > > Thanks Kai, using higher replication count for the purpose of? > > regards, > Lin > > On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k...@123.org> wrote: > >> Hi, >> >> Am 22.12.2012 um 13:03 schrieb Lin Ma <lin...@gmail.com>: >> >> > I want to confirm when on each task node either mapper or reducer >> access distributed cache file, it resides on disk, not resides in memory. >> Just want to make sure distributed cache file does not fully loaded into >> memory which compete memory consumption with mapper/reducer tasks. Is that >> correct? >> >> >> Yes, you are correct. The JobTracker will put files for the distributed >> cache into HDFS with a higher replication count (10 by default). Whenever a >> TaskTracker needs those files for a task it is launching locally, it will >> fetch a copy to its local disk. So it won't need to do this again for >> future tasks on this node. After a job is done, all local copies and the >> HDFS copies of files in the distributed cache are cleaned up. >> >> Kai >> >> -- >> Kai Voigt >> k...@123.org >> >> >> >> >> > > -- > Kai Voigt > k...@123.org > > > > >