Thanks All; What I am trying to do is have a set of text files (say gutenberg books) stored on HDFS, processed by hadoop, and web servable by httpd. So I don't want to have files compressed (or not?) . Now I am writing a custom MultiFileInputFormat to process a customizable number of files at once (seems requries also an adequate recordReader?)
I have already tasktracker.map.tasks set to maximum of 20, and still the value "running" map processes (ie. concurrently?) on job tracker web info page is 8, that is 2 per node (4n cluster), though it should be equivalent to 4*20=80 right? regardless of how many maps pending or maps total (i.e.: regardless of files splits/ block size). Also, I noticed that although 2 "running' process on the webpage, actually 4 tasktracker child processes on the result of command jps (per node). K. Honsali On 24/01/2008, Amar Kamat <[EMAIL PROTECTED]> wrote: > > Khalil Honsali wrote: > > Hi, > > > > I am experiencing a similar problem, even after varying [blocksize], > > [splitsize] and [num map tasks] in both API and hadoop-site.xml; the num > of > > map tasks was 8 instead of expected 20 on a 4 node cluster. > > > > I am working with text files, there is an issue about this where the > > solution suggests to zip the files so that a single zip >> block. > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg02836.html > > > > However, I still don't understand two issues: > > - what is the relations between num files, file size, block size, split > size > > and num map tasks. > > > Only thing that matters are block size and split size. Block is the > basic storage unit on the DFS while splits form the basic unit for maps. > An input file can be split into smaller chunks each of block-size and > stored on the DFS. Given a InputFormat you define what a split is and > hence determine the total number of maps. You can provide your own input > format and control the maps. For example when I wanted to write a code > for inverted-indexing, I wrote an InputFormat that treats a file as a > non splittable entity and should be processed as a whole. In that case > #maps = num files in my input directory. > > - what if I wanted to serve the text files directly from HDFS to HTTP, I > > don't want to zip and unzip them each time right? how to configure > hadoop so > > that it works best with small files directly (maybe not designed for > that?) > > > > > What exactly are you trying to achieve? > > Finally, I wonder if it would be useful to have a tool for estimating > > optimum performance based on the workload parameters, instead of manual > > trial/error. > > > > > no idea > Amar > > thanks very much! > > > > > > On 23/01/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > >> > >> Setting the number of maps lower than would otherwise be used is useful > if > >> you have a job that should not clog up the cluster. If you don't need > it > >> to > >> run quickly, then you can set m = N / 5 or so and get slow progress > with > >> small impact on the throughput of the cluster. > >> > >> IF and when hadoop-2573 gets resolve, then there will be a much better > >> answer for this. > >> > >> > >> On 1/22/08 8:01 PM, "Amar Kamat" <[EMAIL PROTECTED]> wrote: > >> > >> > >>> Hi, > >>> You can't directly control the number of maps. Its based on the splits > >>> of the data residing on the DFS. The number one provides using > >>> command-line/code/the conf-files are hints to HADOOP. I guess this is > >>> for the reason that if the #maps (set externally) is less than the > >>> #splits, we might end up migrating the data which is a performance > hit. > >>> There could be other reasons too. > >>> Amar > >>> Stefan Groschupf wrote: > >>> > >>>> Hi, > >>>> I have trouble setting the number of maps for a job with version 15.1 > . > >>>> As far I understand I can configure the number of maps that a job > will > >>>> do in an hadoop-site.xml on the box where I submit the job (that is > >>>> not the jobtracker box). > >>>> However my configuration is always ignored. Also changing the value > in > >>>> the hadoop-site on the jobtracker box and restarting the nodes does > >>>> not help. > >>>> Also I do not set the number via API. > >>>> Any ideas where I might oversee something? > >>>> Thanks for any hints, > >>>> Stefan > >>>> > >>>> > >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > >>>> 101tec Inc. > >>>> Menlo Park, California, USA > >>>> http://www.101tec.com > >>>> > >>>> > >>>> > >> > > > > > > -- > > > > > >