[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047122#comment-13047122
 ] 

Gabriele Kahlout commented on NUTCH-961:
----------------------------------------

BTW, have you considered a more general patch to support (rather than expose) 
all of tika's options? I'm just thinking that perhaps no special Boilerpipe 
per-se support should (for the sake of code maintainability) be exposed at the 
Nutch level, but only an ability to pass parameters to tika. So at the nutch 
level one sets properties in nutch-site.xml (or even tika-site.xml) and those 
are forwarded to tika to the tika-delegating parser plugin.
There should therefore be no need for any Boilerpipe testing for example, but 
rather tika integration testing.
I'm just thinking out loud (w/o any patch).

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to