Hello - there are no responses, and i don't know what R is, but you are 
interested in HTML parsing, specifically topic detection, so here are my 
thoughts.

We have done topic detection in our custom HTML parser, but in Nutch speak we 
would do it in a ParseFilter implementation. Get the extracted text - a problem 
on its own - and feed it into a model builder with annotated data. Use the 
produced model in the ParseFilter to get the topic.

In our case we used Mallet, and it produced decent results, although we needed 
lots of code to facilitate the whole thing and keep stable results between 
model iterations.

If R has a Java interface, the ParseFilter is the place to be because there you 
can feed the text into the model, and get the topic back.

If R is not Java, i would - and have done - build a simple HTTP daemon around 
it, and call it over HTTP. It breaks a Hadoop principle of bringing code to 
data but rules can be broken. On the other hand, topic models are usually very 
large due to the amount of vocabulary. Not bringing the data with the code each 
time has its benefits too.

Regards,
M.
 
-----Original message-----
> From:Semyon Semyonov <semyon.semyo...@mail.com>
> Sent: Friday 3rd November 2017 16:59
> To: user@nutch.apache.org
> Subject: Nutch(plugins) and R
> 
> Hello,
> 
> I'm looking for a way to use R in Nutch, particularly HTML parser, but usage 
> in the other parts can be intresting as well. For each parsed document I 
> would like to run a script and provide the results back to the system e.g. 
> topic detection of the document.
>  
> NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R 
> server. This way uses Hadoop as an execution engine after the crawling 
> process. In other words, first the computationally intensive full crawling 
> after that another computationally intensive R/Hadoop process.
>  
> Instead I'm looking for a way of calling R scripts directly from java code of 
> map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - 
> Binary R server", but I'm looking for alternatives, to compare efficiency.
> 
> Semyon.
> 

Reply via email to