Hi

Thanks for the offer Eriend but unfortunately I'm not a java developer.  I
opted to abandon doing it with Nutch and/or Tika in favour of a Python
approach to retrieve, parse and index the content needed.

I think that Nutch would have been too heavyweight for my purposes anyway.

Thanks again

Andrew

On 29 October 2010 09:35, Erlend Garåsen <[email protected]> wrote:

> On 28.10.10 13.01, Andrzej Bialecki wrote:
>
>  Hopefully this kind of functionality will be improved soon through the
>> use of the Boilerpipe functionality in Tika - but this is still to be
>> integrated into both Tika and Nutch.
>>
>
> Interesting. I also missed this kind of functionality in Nutch, but I found
> a workaround by adding a few lines in DOMContentUtils.java.
>
> I understand that Tika will replace parse-html, and therefor I wasn't sure
> I should add a Jira issue about this missing functionality. Anyway, I have
> found a simple way how one can stop parsing some contents in a web site, for
> example contents between HTML comments (<!-- stopindex --> .... <!--
> startindex -->) and for some div tags.
>
> Andrew, I can send you the lines I added so it will be easier for you to
> write your filter. Or just simply add the lines to DOMContentUtils in case
> you're using parse-html and adapt the code.
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Reply via email to