On 28.10.10 13.01, Andrzej Bialecki wrote:

Hopefully this kind of functionality will be improved soon through the
use of the Boilerpipe functionality in Tika - but this is still to be
integrated into both Tika and Nutch.

Interesting. I also missed this kind of functionality in Nutch, but I found a workaround by adding a few lines in DOMContentUtils.java.

I understand that Tika will replace parse-html, and therefor I wasn't sure I should add a Jira issue about this missing functionality. Anyway, I have found a simple way how one can stop parsing some contents in a web site, for example contents between HTML comments (<!-- stopindex --> .... <!-- startindex -->) and for some div tags.

Andrew, I can send you the lines I added so it will be easier for you to write your filter. Or just simply add the lines to DOMContentUtils in case you're using parse-html and adapt the code.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to