Hi Jasimop, Some initial thoughts of mine are the following >Is you situation identical to the example you provided where every page you crawl is of format <html><body><paragraph>...</paragraph>...<paragraph>...</paragraph></body></html>? If this is the case then it looks like you require some some sort of plugin similar to what Andrzej suggested here [1] however the HtmlParseFilter plugin you implement will need to invoke an action after </paragraph> to initiate the indexing of that section. >If on the other hand your webpage is not exactly like explained in your example e.g. lots of clutter around your required textual content then you are looking solely at extending the functionality discussed here [2] to include the indexing step discussed above.
I am sorry I cannot be of more help just now. I have still to familiarise myself with boilerpipe. [1] http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html [2] https://issues.apache.org/jira/browse/NUTCH-961 Can anyone guide me into the right direction? Where should I start to > search? Classes, wikis, homepages, books? > Nutch does a great job for what I need it now, but I think it lacks a bit > of > documentation, especially when it comes to plugin development. > Yes I do agree with you here to an extent. It would appear that less users have been contributing their knowledge to this section of our wiki. There appears to be a wealth of info relating to legacy stuff on the wiki though! It would be nice to see more examples of good practice and use cases in the future. > How would a bare-bones plugin look like? > Maybe someone can elaborate on the above or possibly correct me if my advice is off track -- *Lewis*

