Fair enough - I write a lot of InputFormats since for most of my problems a line of text is not the proper unit - I read fasta files - read lines intil you hit a line starting with > and xml fragments - read until you hit a closing tag
On Mon, May 7, 2012 at 9:03 AM, GUOJUN Zhu <guojun_...@freddiemac.com>wrote: > > The default FileInputformat split the file according to the size. If you > use line text data, the TextFileInputformat respects the line structure for > input. We got splits as small as a few KBs. The file split is a tricky > business, especially when you want it to respect your logical boundary. It > is better to use the existing battle-test code than invent your own wheel. > > Zhu, Guojun > Modeling Sr Graduate > 571-3824370 > guojun_...@freddiemac.com > Financial Engineering > Freddie Mac > > > *Steve Lewis <lordjoe2...@gmail.com>* > > 05/07/2012 11:17 AM > Please respond to > mapreduce-user@hadoop.apache.org > > To > mapreduce-user@hadoop.apache.org > cc > Subject > Re: Ant Colony Optimization for Travelling Salesman Problem in Hadoop > > > > > Yes but it is the job of the InputFormat code to implement the behavior - > it is not necessary to do so and in other cases I choose to create more > mappers when the mapper has a lot of work > > On Mon, May 7, 2012 at 7:54 AM, GUOJUN Zhu > <*guojun_...@freddiemac.com*<guojun_...@freddiemac.com>> > wrote: > > We are using old API of 0.20. I think when you set "mapred.reduce.tasks" > with certain number N and use fileinputformat, the default behavior is that > any file will be split into that number, N, each split smaller than the > default block size. Of course, other restriction, such as > "mapred.min.split.size" cannot be set too large (default is as small as > possible I think). > > Zhu, Guojun > Modeling Sr Graduate* > **571-3824370* <571-3824370>* > **guojun_...@freddiemac.com* <guojun_...@freddiemac.com> > Financial Engineering > Freddie Mac > > *sharat attupurath <**shara...@hotmail.com* <shara...@hotmail.com>*>* > > 05/05/2012 11:37 AM > > > Please respond to* > **mapreduce-user@hadoop.apache.org* <mapreduce-user@hadoop.apache.org> > > To > <*mapreduce-user@hadoop.apache.org* <mapreduce-user@hadoop.apache.org>> > cc > Subject > RE: Ant Colony Optimization for Travelling Salesman Problem in Hadoop > > > > > > > Since the input files are very small, the default input formats in Hadoop > all generate just a single InputSplit, so only a single map task is > executed, and we wont have much parallelism. > > I was thinking of writing an InputFormat that would read the whole file as > an InputSplit and replicate this input split n times (where n would be the > number of ants in a single stage) so that we'll have n mappers. > Also I want the input format to return the value as the adjacency matrix > of the graph (calculating it from the coordinates in the input file). > > But I can't find a way to do this. Is it possible? Or is it better to just > have the input as Text and create the adjacency matrix in the mappers? > > ------------------------------ > Date: Sat, 5 May 2012 08:20:34 -0700 > Subject: Re: Ant Colony Optimization for Travelling Salesman Problem in > Hadoop > From: *lordjoe2...@gmail.com* <lordjoe2...@gmail.com> > To: *mapreduce-user@hadoop.apache.org* <mapreduce-user@hadoop.apache.org> > > yes - if you know how you can put it in distributed cache or if it is > small put in as a String in the config or have all InputFormats read it > from somewhere > > On Sat, May 5, 2012 at 8:08 AM, sharat attupurath > <*shara...@hotmail.com*<shara...@hotmail.com>> > wrote: > I looked at both the files. in AbstractNShotInputFormat it is mentioned > that this input format does not read from files. My input is in a text > file. I want the whole file as a single record. So is it enough if i just > copy the contents of the file and return it as a string from > getValueFromIndex() ? > > ------------------------------ > Date: Fri, 4 May 2012 13:15:46 -0700 > > Subject: Re: Ant Colony Optimization for Travelling Salesman Problem in > Hadoop > From: *lordjoe2...@gmail.com* <lordjoe2...@gmail.com> > To: *mapreduce-user@hadoop.apache.org* <mapreduce-user@hadoop.apache.org> > > Look at NShotInputFormat > call setNumberKeys to set the number of ants then > change the method getValueFromIndex to return the text representing the > original problem to each mapper - > what happens in the second and third jobs is an exercise but the saved > test needs to have pheremones and the original problem > > > On Fri, May 4, 2012 at 9:54 AM, sharat attupurath > <*shara...@hotmail.com*<shara...@hotmail.com>> > wrote: > Hi, > > Thanks a lot for the quick reply sir! We are new to Apache Hadoop and > haven't yet understood it properly yet. Can you please elaborate on how we > can have multiple stages of mapreduce jobs for combining the trails as you > have mentioned? > > We have been trying to find out how to write a custom splitter and almost > all online resources tell to subclass the FileInputFormat and write only a > custom RecordReader? Will it be possible to generate our splits in that way? > > Regards > > Sharat > > ------------------------------ > Date: Fri, 4 May 2012 09:22:51 -0700 > Subject: Re: Ant Colony Optimization for Travelling Salesman Problem in > Hadoop > From: *lordjoe2...@gmail.com* <lordjoe2...@gmail.com> > To: *mapreduce-user@hadoop.apache.org* <mapreduce-user@hadoop.apache.org> > > > > > On Fri, May 4, 2012 at 9:06 AM, sharat attupurath > <*shara...@hotmail.com*<shara...@hotmail.com>> > wrote: > Hi, > > We are trying to parallelize the ant colony optimization algorithm for TSP > over hadoop and are facing some issues. We are using TSPLIB as input files. > The input is a text file containing eucledian coordinates of the cities - > first column is city number and the next two columns contain the x and y > coordinates respectively. > > What we intend to do is to take input from this single file, send copies > of it to multiple mappers (each mapper acts like the ant in the algorithm), > each mapper works on its input to find its own TSP solution that it outputs > and finally the reducer outputs the smallest tour found by the mappers. > Hope we are in the right track. Here are the issues: > > 1) Since the input file is small, we need to force hadoop to fire up > multiple map tasks by replicating the input. How can we make an InputSplit > of the whole file and replicate it so that the input can be sent to > multiple mappers? > Write a custom splitter sending the same data to all mappers - the only > critical criteria is the number of "Ants" > > 2) the algorithm uses a shared pheromone array and each mapper needs to > read and write data from this. How can we share the pheromone data across > the mappers. > You cannot share data across mappers and should not attempt to do so - > Better to use the reducer(s) to combine trails combine first pass trails > and > then pass the combined trails to another mapreduce job - this with the > original problem plus the current pheremone trails > > Hope the questions are clear enough. Any help would be greatly > appreciated. > > Thank you > > Regards > > Sharat > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 * > **206-384-1340* <206-384-1340> (cell) > Skype lordjoe_com > > > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 * > **206-384-1340* <206-384-1340> (cell) > Skype lordjoe_com > > > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 * > **206-384-1340* <206-384-1340> (cell) > Skype lordjoe_com > > > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com