Hi Julien,

On Wed, Feb 15, 2012 at 12:27 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> I assume Tika does already - why should we duplicate the tests in Nutch?

We don't want to I suppose. However the point I was trying to make was that
as NUTCH-1259 detects the encoding type, however we don't have an automated
test to cover this, I assume the case is somewhat important or else the
ticket for NUTCH-1259 wouldn't have been opened originally? I agree with
you that general cases should be dealt with further upstream within Tika
development itself, however as the encoding detection is done in Nutch
within the cd metadata we may wish to get some test case to check... it's
not a huge thing I suppose.


> we delegate the functionality to Tika, IMHO this means delegating the
> testing as well. What we could do to contribute tests to Tika instead if it
> does not have any.
>
> Yeah this is correct. I'm expecting you guys will know better than me but
I would assume that Tika is mimetype and encoding detection compliant ;0)


> Re-any23 : why not handling it as a Tika parser instead of a Nutch one?
> This could be useful to other Tika users who do not necessarily use Nutch
>
OK so I suppose this is completely open for discussion and I really welcome
it as well. On one hand I see working with Any23 as a parse-any23 plugin
within Nutch as the first step in the road to answering this question.
Regardless of whether Any23 graduates and is integrated into Tika itself or
as a TLP you are completely right that it should be made as openly
available to as many people. Personally I agree with you Julien.

One last thing, I know this if off topic... but with regards to our
microformats-reltag plugin... I think the RelTagParser could and should be
move over to Any23. Any23 already supports extraction of an number of
microformats. wdyt?

Thanks

Reply via email to