Hi Jukka,

On Sun, Sep 12, 2010 at 5:46 PM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
But that also seems clunky. Any other suggestions?

A simpler approach would be to simply pass a list of already
instantiated Parser objects to AutoDetectParser, like this:

   public AutoDetectParser(Detector detector, Parser... parsers) {
       setDetector(detector);
       Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
       ParseContext context = new ParseContext();
       for (Parser parser : parsers) {
           for (MediaType type : parser.getSupportedTypes(context)) {
               map.put(type, parser);
           }
       }
       setParsers(map);
   }

Thanks for the suggestion. This would work for the current 0.8 code base, so I might just go ahead and add that.

But I found a few other places that called TikaConfig.getDefaultConfig() besides AutoDetectParser():
        
 - Tika()
 - MediaTypeRegistry.getDefaultRegistry()

These don't seem to be used outside of test code, but I could easily see people adding calls to them (and getDefaultConfig).

Depending on not having any calls to this from anywhere else in the Tika sub-system seems fragile, so a more resilient solution would be good. Especially since this is the second time this problem has bitten me during a big parse job (20M+ documents).

-- Ken


BTW, the need to pass a MediaType->Parser map to
CompositeParser.setParsers() is a remnant of the time when we didn't
have the Parser.getSupportedTypes() method. Nowadays it would probably
be better to simply pass a collection of parsers and use
getSupportedTypes() calls for dispatch during CompositeParser.parse().

As an aside, what's the standard use case for specifying an explicit
classloader? I haven't seen this used in other projects, so I'm curious.

See TIKA-419 [1] the relevant background.

[1] https://issues.apache.org/jira/browse/TIKA-419

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to