No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.

Thanks,


On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Are you implementing this for instruction or production?
>
> If production, why not use Lucene?
>
>
> On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>
> > HI  Amar , Theodore, Arun,
> >
> > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> much.
> > I have written following code for inverted index. This code maps each
> word
> > from the document to its document id.
> > ex: apple file1 file123
> > Main functions of the code are:-
> >
> > public class HadoopProgram extends Configured implements Tool {
> > public static class MapClass extends MapReduceBase
> >     implements Mapper<LongWritable, Text, Text, Text> {
> >
> >     private final static IntWritable one = new IntWritable(1);
> >     private Text word = new Text();
> >     private Text doc = new Text();
> >     private long numRecords=0;
> >     private String inputFile;
> >
> >    public void configure(JobConf job){
> >         System.out.println("Configure function is called");
> >         inputFile = job.get("map.input.file");
> >         System.out.println("In conf the input file is"+inputFile);
> >     }
> >
> >
> >     public void map(LongWritable key, Text value,
> >                     OutputCollector<Text, Text> output,
> >                     Reporter reporter) throws IOException {
> >       String line = value.toString();
> >       StringTokenizer itr = new StringTokenizer(line);
> >       doc.set(inputFile);
> >       while (itr.hasMoreTokens()) {
> >         word.set(itr.nextToken());
> >         output.collect(word,doc);
> >       }
> >       if(++numRecords%4==0){
> >        System.out.println("Finished processing of input
> file"+inputFile);
> >      }
> >     }
> >   }
> >
> >   /**
> >    * A reducer class that just emits the sum of the input values.
> >    */
> >   public static class Reduce extends MapReduceBase
> >     implements Reducer<Text, Text, Text, DocIDs> {
> >
> >   // This works as K2, V2, K3, V3
> >     public void reduce(Text key, Iterator<Text> values,
> >                        OutputCollector<Text, DocIDs> output,
> >                        Reporter reporter) throws IOException {
> >       int sum = 0;
> >       Text dummy = new Text();
> >       ArrayList<String> IDs = new ArrayList<String>();
> >       String str;
> >
> >       while (values.hasNext()) {
> >          dummy = values.next();
> >          str = dummy.toString();
> >          IDs.add(str);
> >        }
> >        DocIDs dc = new DocIDs();
> >        dc.setListdocs(IDs);
> >       output.collect(key,dc);
> >     }
> >   }
> >
> >  public int run(String[] args) throws Exception {
> >   System.out.println("Run function is called");
> >     JobConf conf = new JobConf(getConf(), WordCount.class);
> >     conf.setJobName("wordcount");
> >
> >     // the keys are words (strings)
> >     conf.setOutputKeyClass(Text.class);
> >
> >     conf.setOutputValueClass(Text.class);
> >
> >
> >     conf.setMapperClass(MapClass.class);
> >
> >     conf.setReducerClass(Reduce.class);
> > }
> >
> >
> > Now I am getting output array from the reducer as:-
> > word \root\test\test123, \root\test12
> >
> > In the next stage I want to stop 'stop  words',  scrub words etc. and
> like
> > position of the word in the document. How would I apply multiple maps or
> > multilevel map reduce jobs programmatically? I guess I need to make
> another
> > class or add some functions in it? I am not able to figure it out.
> > Any pointers for these type of problems?
> >
> > Thanks,
> > Aayush
> >
> >
> > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> wrote:
> >
> >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> >>
> >>> HI,
> >>> I am developing the simple inverted index program frm the hadoop. My
> map
> >>> function has the output:
> >>> <word, doc>
> >>> and the reducer has:
> >>> <word, list(docs)>
> >>>
> >>> Now I want to use one more mapreduce to remove stop and scrub words
> from
> >> Use distributed cache as Arun mentioned.
> >>> this output. Also in the next stage I would like to have short summay
> >> Whether to use a separate MR job depends on what exactly you mean by
> >> summary. If its like a window around the current word then you can
> >> possibly do it in one go.
> >> Amar
> >>> associated with every word. How should I design my program from this
> >> stage?
> >>> I mean how would I apply multiple mapreduce to this? What would be the
> >>> better way to perform this?
> >>>
> >>> Thanks,
> >>>
> >>> Regards,
> >>> -
> >>>
> >>>
> >>
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240

Reply via email to