FW: Nutch(plugins) and R

Markus Jelsma Tue, 07 Nov 2017 16:32:20 -0800

cc list


-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Wednesday 8th November 2017 0:15
> To: user@nutch.apache.org
> Subject: RE: Nutch(plugins) and R
> 
> Hello - there are no responses, and i don't know what R is, but you are 
> interested in HTML parsing, specifically topic detection, so here are my 
> thoughts.
> 
> We have done topic detection in our custom HTML parser, but in Nutch speak we 
> would do it in a ParseFilter implementation. Get the extracted text - a 
> problem on its own - and feed it into a model builder with annotated data. 
> Use the produced model in the ParseFilter to get the topic.
> 
> In our case we used Mallet, and it produced decent results, although we 
> needed lots of code to facilitate the whole thing and keep stable results 
> between model iterations.
> 
> If R has a Java interface, the ParseFilter is the place to be because there 
> you can feed the text into the model, and get the topic back.
> 
> If R is not Java, i would - and have done - build a simple HTTP daemon around 
> it, and call it over HTTP. It breaks a Hadoop principle of bringing code to 
> data but rules can be broken. On the other hand, topic models are usually 
> very large due to the amount of vocabulary. Not bringing the data with the 
> code each time has its benefits too.
> 
> Regards,
> M.
>  
> -----Original message-----
> > From:Semyon Semyonov <semyon.semyo...@mail.com>
> > Sent: Friday 3rd November 2017 16:59
> > To: user@nutch.apache.org
> > Subject: Nutch(plugins) and R
> > 
> > Hello,
> > 
> > I'm looking for a way to use R in Nutch, particularly HTML parser, but 
> > usage in the other parts can be intresting as well. For each parsed 
> > document I would like to run a script and provide the results back to the 
> > system e.g. topic detection of the document.
> >  
> > NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft 
> > R server. This way uses Hadoop as an execution engine after the crawling 
> > process. In other words, first the computationally intensive full crawling 
> > after that another computationally intensive R/Hadoop process.
> >  
> > Instead I'm looking for a way of calling R scripts directly from java code 
> > of map or reduce jobs. Any ideas how to make it? One way to do it is 
> > "Rserve - Binary R server", but I'm looking for alternatives, to compare 
> > efficiency.
> > 
> > Semyon.
> >

FW: Nutch(plugins) and R

Reply via email to