Hello - there are no responses, and i don't know what R is, but you are interested in HTML parsing, specifically topic detection, so here are my thoughts.
We have done topic detection in our custom HTML parser, but in Nutch speak we would do it in a ParseFilter implementation. Get the extracted text - a problem on its own - and feed it into a model builder with annotated data. Use the produced model in the ParseFilter to get the topic. In our case we used Mallet, and it produced decent results, although we needed lots of code to facilitate the whole thing and keep stable results between model iterations. If R has a Java interface, the ParseFilter is the place to be because there you can feed the text into the model, and get the topic back. If R is not Java, i would - and have done - build a simple HTTP daemon around it, and call it over HTTP. It breaks a Hadoop principle of bringing code to data but rules can be broken. On the other hand, topic models are usually very large due to the amount of vocabulary. Not bringing the data with the code each time has its benefits too. Regards, M. -----Original message----- > From:Semyon Semyonov <semyon.semyo...@mail.com> > Sent: Friday 3rd November 2017 16:59 > To: user@nutch.apache.org > Subject: Nutch(plugins) and R > > Hello, > > I'm looking for a way to use R in Nutch, particularly HTML parser, but usage > in the other parts can be intresting as well. For each parsed document I > would like to run a script and provide the results back to the system e.g. > topic detection of the document. > > NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R > server. This way uses Hadoop as an execution engine after the crawling > process. In other words, first the computationally intensive full crawling > after that another computationally intensive R/Hadoop process. > > Instead I'm looking for a way of calling R scripts directly from java code of > map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - > Binary R server", but I'm looking for alternatives, to compare efficiency. > > Semyon. >