No, currently my requirement is to solve this problem by apache hadoop. I am trying to build up this type of inverted index and then measure performance criteria with respect to others.
Thanks, On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Are you implementing this for instruction or production? > > If production, why not use Lucene? > > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > > > HI Amar , Theodore, Arun, > > > > Thanks for your reply. Actaully I am new to hadoop so cant figure out > much. > > I have written following code for inverted index. This code maps each > word > > from the document to its document id. > > ex: apple file1 file123 > > Main functions of the code are:- > > > > public class HadoopProgram extends Configured implements Tool { > > public static class MapClass extends MapReduceBase > > implements Mapper<LongWritable, Text, Text, Text> { > > > > private final static IntWritable one = new IntWritable(1); > > private Text word = new Text(); > > private Text doc = new Text(); > > private long numRecords=0; > > private String inputFile; > > > > public void configure(JobConf job){ > > System.out.println("Configure function is called"); > > inputFile = job.get("map.input.file"); > > System.out.println("In conf the input file is"+inputFile); > > } > > > > > > public void map(LongWritable key, Text value, > > OutputCollector<Text, Text> output, > > Reporter reporter) throws IOException { > > String line = value.toString(); > > StringTokenizer itr = new StringTokenizer(line); > > doc.set(inputFile); > > while (itr.hasMoreTokens()) { > > word.set(itr.nextToken()); > > output.collect(word,doc); > > } > > if(++numRecords%4==0){ > > System.out.println("Finished processing of input > file"+inputFile); > > } > > } > > } > > > > /** > > * A reducer class that just emits the sum of the input values. > > */ > > public static class Reduce extends MapReduceBase > > implements Reducer<Text, Text, Text, DocIDs> { > > > > // This works as K2, V2, K3, V3 > > public void reduce(Text key, Iterator<Text> values, > > OutputCollector<Text, DocIDs> output, > > Reporter reporter) throws IOException { > > int sum = 0; > > Text dummy = new Text(); > > ArrayList<String> IDs = new ArrayList<String>(); > > String str; > > > > while (values.hasNext()) { > > dummy = values.next(); > > str = dummy.toString(); > > IDs.add(str); > > } > > DocIDs dc = new DocIDs(); > > dc.setListdocs(IDs); > > output.collect(key,dc); > > } > > } > > > > public int run(String[] args) throws Exception { > > System.out.println("Run function is called"); > > JobConf conf = new JobConf(getConf(), WordCount.class); > > conf.setJobName("wordcount"); > > > > // the keys are words (strings) > > conf.setOutputKeyClass(Text.class); > > > > conf.setOutputValueClass(Text.class); > > > > > > conf.setMapperClass(MapClass.class); > > > > conf.setReducerClass(Reduce.class); > > } > > > > > > Now I am getting output array from the reducer as:- > > word \root\test\test123, \root\test12 > > > > In the next stage I want to stop 'stop words', scrub words etc. and > like > > position of the word in the document. How would I apply multiple maps or > > multilevel map reduce jobs programmatically? I guess I need to make > another > > class or add some functions in it? I am not able to figure it out. > > Any pointers for these type of problems? > > > > Thanks, > > Aayush > > > > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> > wrote: > > > >> On Wed, 26 Mar 2008, Aayush Garg wrote: > >> > >>> HI, > >>> I am developing the simple inverted index program frm the hadoop. My > map > >>> function has the output: > >>> <word, doc> > >>> and the reducer has: > >>> <word, list(docs)> > >>> > >>> Now I want to use one more mapreduce to remove stop and scrub words > from > >> Use distributed cache as Arun mentioned. > >>> this output. Also in the next stage I would like to have short summay > >> Whether to use a separate MR job depends on what exactly you mean by > >> summary. If its like a window around the current word then you can > >> possibly do it in one go. > >> Amar > >>> associated with every word. How should I design my program from this > >> stage? > >>> I mean how would I apply multiple mapreduce to this? What would be the > >>> better way to perform this? > >>> > >>> Thanks, > >>> > >>> Regards, > >>> - > >>> > >>> > >> > > -- Aayush Garg, Phone: +41 76 482 240