RE: openNLP with Hadoop MapReduce Programming

Sheng Guo Tue, 12 Jun 2012 18:06:36 -0700

Hi Carlos,

Sorry I wrote that one and half a year ago and it was in company codebase.
But the basic procedure is simple.
first you upload your model to the HDFS, then before you run the job, you do 
this:
DistributedCache.addCacheFile(new URI(YourModelFilePath), jobConf);
then in your configure method of your mapper, you write something like this:


      {
        super.configure(job);
        
        try{
          
            Path[] localFiles = DistributedCache.getLocalCacheFiles(job) ;
            
            if (localFiles != null )
            {
  
              String metadataFileName = "";
              
              for (int i = 0; i < localFiles.length; i++)
              {
                String strFileName = localFiles[i].toString();
                if (strFileName.contains(job.get("modelPath")))
                {
                  metadataFileName = strFileName;       
                  break;
                }
              }
              if(metadataFileName.length()>0){
                json_model = new JSONObject();
                readHashSet("file:///" + metadataFileName, json_model);
              }
              System.out.println("*********"+metadataFileName);
              //model_dir = metadataFileName.substring(0, 
metadataFileName.lastIndexOf(str))
  
            }
        }catch(Exception e){
            e.printStackTrace();
//            info = "except";
        }
      }

json_model should be a static variable inside your mapper, be aware that this 
is just converted from some pseudo code.

Sheng

> Date: Tue, 12 Jun 2012 17:52:34 -0600
> Subject: Re: openNLP with Hadoop MapReduce Programming
> From: [email protected]
> To: [email protected]
> 
> Sheng,
> 
> Can you provide an example?
> 
> Thanks,
> 
> Carlos.
> 
> On Thu, Jun 7, 2012 at 6:37 PM, Sheng Guo <[email protected]> wrote:
> 
> >
> > I think distributed cache is a good way to do this.
> > I did some similar work about stanford parser model loading in Hadoop
> > using distributed cache.
> > I think that will solve the problem. But we should be careful because the
> > Hadoop system is normally data-intensive, and NLP handling there may cause
> > high-CPU usage and problem to other jobs.
> >
> > Sheng
> >
> > > Date: Thu, 7 Jun 2012 20:17:26 -0400
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: Re: openNLP with Hadoop MapReduce Programming
> > >
> > > Hadoop seems to be a large scale project; so, the work would be spread
> > > across many servers / clients to perform the work.  The map reduce would
> > > allow all the processes across many servers to be done and then
> > > synchronized to provide the final results.  So, each process would have
> > > to load its own model.  The file system using HDFS should allow sharing
> > > of the models and large data collection between them all.
> > >
> > > On 6/7/2012 3:45 AM, Jörn Kottmann wrote:
> > > > On 06/07/2012 05:39 AM, James Kosin wrote:
> > > >> Hmm, good idea.  I'll have to try that soon... I do create models for
> > my
> > > >> project and have them included in the JAR... but, haven't gotten
> > around
> > > >> to testing with them embedded in the JAR file.  I know there will be
> > > >> issues with this and it is usually best to keep them in either windows
> > > >> or linux file system.
> > > >> Jorn has the start of supporting the web-server side; but, I know it
> > is
> > > >> far from complete... he still has this marked as a TODO for the
> > > >> interface.  Unless I'm a bit behind now.
> > > >
> > > > I usually load my models from an http server, because
> > > > they are getting updated much more frequently than
> > > > my jars, but if you use map reduce you will need to do
> > > > the loading yourself (very easy in java).
> > > >
> > > > Just including a model in a jar works great and many
> > > > people actually do that.
> > > >
> > > > If you have many threads you want to share the models
> > > > between them I am not sure how this is done in map reduce.
> > > >
> > > > Jörn
> > >
> > >
> >
> >

RE: openNLP with Hadoop MapReduce Programming

Reply via email to