Re: Hadoop: Multiple map reduce or some better way

Aayush Garg Fri, 04 Apr 2008 13:29:17 -0700

Hi,

I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.



On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote:

> You can build Lucene indexes using Hadoop Map/Reduce. See the index
> contrib package in the trunk. Or is it still not something you are
> looking for?
>
> Regards,
> Ning
>
> On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> > No, currently my requirement is to solve this problem by apache hadoop.
> I am
> > trying to build up this type of inverted index and then measure
> performance
> > criteria with respect to others.
> >
> > Thanks,
> >
> >
> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > Are you implementing this for instruction or production?
> > >
> > > If production, why not use Lucene?
> > >
> > >
> > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> > >
> > > > HI  Amar , Theodore, Arun,
> > > >
> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
> out
> > > much.
> > > > I have written following code for inverted index. This code maps
> each
> > > word
> > > > from the document to its document id.
> > > > ex: apple file1 file123
> > > > Main functions of the code are:-
> > > >
> > > > public class HadoopProgram extends Configured implements Tool {
> > > > public static class MapClass extends MapReduceBase
> > > >     implements Mapper<LongWritable, Text, Text, Text> {
> > > >
> > > >     private final static IntWritable one = new IntWritable(1);
> > > >     private Text word = new Text();
> > > >     private Text doc = new Text();
> > > >     private long numRecords=0;
> > > >     private String inputFile;
> > > >
> > > >    public void configure(JobConf job){
> > > >         System.out.println("Configure function is called");
> > > >         inputFile = job.get("map.input.file");
> > > >         System.out.println("In conf the input file is"+inputFile);
> > > >     }
> > > >
> > > >
> > > >     public void map(LongWritable key, Text value,
> > > >                     OutputCollector<Text, Text> output,
> > > >                     Reporter reporter) throws IOException {
> > > >       String line = value.toString();
> > > >       StringTokenizer itr = new StringTokenizer(line);
> > > >       doc.set(inputFile);
> > > >       while (itr.hasMoreTokens()) {
> > > >         word.set(itr.nextToken());
> > > >         output.collect(word,doc);
> > > >       }
> > > >       if(++numRecords%4==0){
> > > >        System.out.println("Finished processing of input
> > > file"+inputFile);
> > > >      }
> > > >     }
> > > >   }
> > > >
> > > >   /**
> > > >    * A reducer class that just emits the sum of the input values.
> > > >    */
> > > >   public static class Reduce extends MapReduceBase
> > > >     implements Reducer<Text, Text, Text, DocIDs> {
> > > >
> > > >   // This works as K2, V2, K3, V3
> > > >     public void reduce(Text key, Iterator<Text> values,
> > > >                        OutputCollector<Text, DocIDs> output,
> > > >                        Reporter reporter) throws IOException {
> > > >       int sum = 0;
> > > >       Text dummy = new Text();
> > > >       ArrayList<String> IDs = new ArrayList<String>();
> > > >       String str;
> > > >
> > > >       while (values.hasNext()) {
> > > >          dummy = values.next();
> > > >          str = dummy.toString();
> > > >          IDs.add(str);
> > > >        }
> > > >        DocIDs dc = new DocIDs();
> > > >        dc.setListdocs(IDs);
> > > >       output.collect(key,dc);
> > > >     }
> > > >   }
> > > >
> > > >  public int run(String[] args) throws Exception {
> > > >   System.out.println("Run function is called");
> > > >     JobConf conf = new JobConf(getConf(), WordCount.class);
> > > >     conf.setJobName("wordcount");
> > > >
> > > >     // the keys are words (strings)
> > > >     conf.setOutputKeyClass(Text.class);
> > > >
> > > >     conf.setOutputValueClass(Text.class);
> > > >
> > > >
> > > >     conf.setMapperClass(MapClass.class);
> > > >
> > > >     conf.setReducerClass(Reduce.class);
> > > > }
> > > >
> > > >
> > > > Now I am getting output array from the reducer as:-
> > > > word \root\test\test123, \root\test12
> > > >
> > > > In the next stage I want to stop 'stop  words',  scrub words etc.
> and
> > > like
> > > > position of the word in the document. How would I apply multiple
> maps or
> > > > multilevel map reduce jobs programmatically? I guess I need to make
> > > another
> > > > class or add some functions in it? I am not able to figure it out.
> > > > Any pointers for these type of problems?
> > > >
> > > > Thanks,
> > > > Aayush
> > > >
> > > >
> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > > >>
> > > >>> HI,
> > > >>> I am developing the simple inverted index program frm the hadoop.
> My
> > > map
> > > >>> function has the output:
> > > >>> <word, doc>
> > > >>> and the reducer has:
> > > >>> <word, list(docs)>
> > > >>>
> > > >>> Now I want to use one more mapreduce to remove stop and scrub
> words
> > > from
> > > >> Use distributed cache as Arun mentioned.
> > > >>> this output. Also in the next stage I would like to have short
> summay
> > > >> Whether to use a separate MR job depends on what exactly you mean
> by
> > > >> summary. If its like a window around the current word then you can
> > > >> possibly do it in one go.
> > > >> Amar
> > > >>> associated with every word. How should I design my program from
> this
> > > >> stage?
> > > >>> I mean how would I apply multiple mapreduce to this? What would be
> the
> > > >>> better way to perform this?
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Regards,
> > > >>> -
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
> >
> > --
> > Aayush Garg,
> > Phone: +41 76 482 240
> >
>



-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Reply via email to