cc list
-----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Wednesday 8th November 2017 0:15 > To: user@nutch.apache.org > Subject: RE: Nutch(plugins) and R > > Hello - there are no responses, and i don't know what R is, but you are > interested in HTML parsing, specifically topic detection, so here are my > thoughts. > > We have done topic detection in our custom HTML parser, but in Nutch speak we > would do it in a ParseFilter implementation. Get the extracted text - a > problem on its own - and feed it into a model builder with annotated data. > Use the produced model in the ParseFilter to get the topic. > > In our case we used Mallet, and it produced decent results, although we > needed lots of code to facilitate the whole thing and keep stable results > between model iterations. > > If R has a Java interface, the ParseFilter is the place to be because there > you can feed the text into the model, and get the topic back. > > If R is not Java, i would - and have done - build a simple HTTP daemon around > it, and call it over HTTP. It breaks a Hadoop principle of bringing code to > data but rules can be broken. On the other hand, topic models are usually > very large due to the amount of vocabulary. Not bringing the data with the > code each time has its benefits too. > > Regards, > M. > > -----Original message----- > > From:Semyon Semyonov <semyon.semyo...@mail.com> > > Sent: Friday 3rd November 2017 16:59 > > To: user@nutch.apache.org > > Subject: Nutch(plugins) and R > > > > Hello, > > > > I'm looking for a way to use R in Nutch, particularly HTML parser, but > > usage in the other parts can be intresting as well. For each parsed > > document I would like to run a script and provide the results back to the > > system e.g. topic detection of the document. > > > > NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft > > R server. This way uses Hadoop as an execution engine after the crawling > > process. In other words, first the computationally intensive full crawling > > after that another computationally intensive R/Hadoop process. > > > > Instead I'm looking for a way of calling R scripts directly from java code > > of map or reduce jobs. Any ideas how to make it? One way to do it is > > "Rserve - Binary R server", but I'm looking for alternatives, to compare > > efficiency. > > > > Semyon. > >