Hi Julien, On Wed, Feb 15, 2012 at 12:27 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote:
> I assume Tika does already - why should we duplicate the tests in Nutch? We don't want to I suppose. However the point I was trying to make was that as NUTCH-1259 detects the encoding type, however we don't have an automated test to cover this, I assume the case is somewhat important or else the ticket for NUTCH-1259 wouldn't have been opened originally? I agree with you that general cases should be dealt with further upstream within Tika development itself, however as the encoding detection is done in Nutch within the cd metadata we may wish to get some test case to check... it's not a huge thing I suppose. > we delegate the functionality to Tika, IMHO this means delegating the > testing as well. What we could do to contribute tests to Tika instead if it > does not have any. > > Yeah this is correct. I'm expecting you guys will know better than me but I would assume that Tika is mimetype and encoding detection compliant ;0) > Re-any23 : why not handling it as a Tika parser instead of a Nutch one? > This could be useful to other Tika users who do not necessarily use Nutch > OK so I suppose this is completely open for discussion and I really welcome it as well. On one hand I see working with Any23 as a parse-any23 plugin within Nutch as the first step in the road to answering this question. Regardless of whether Any23 graduates and is integrated into Tika itself or as a TLP you are completely right that it should be made as openly available to as many people. Personally I agree with you Julien. One last thing, I know this if off topic... but with regards to our microformats-reltag plugin... I think the RelTagParser could and should be move over to Any23. Any23 already supports extraction of an number of microformats. wdyt? Thanks