On Sep 11, 2010, at 1:17pm, Ken Krugler wrote:

On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch <nick.bu...@alfresco.com> wrote:
Quite a lot of OfficeParser does depend on poifs code though, as well as a
few bits that depend on some of the less common POI text extractors.

It looks like a number of our other new parsers also have direct
dependencies to external libraries, so this problem is not just
related to the OfficeParser class.

The basic problem here is that the service loader used by the default
TikaConfig constructor throws an exception when it can't load a class
listed in a org.apache.tika.parser.Parser service file. The solution I
implemented in TIKA-378 for the 0.7 release was to move the external
parser library references to separate extractor classes so that the
parser class could be instantiated without problems. Unfortunately
this was a one-off solution that obviously hasn't survived further
development in the svn trunk.

The reason why I originally didn't want to simply catch and ignore the
potential exceptions in the TikaConfig constructor was the lack of a
good error reporting mechanism. The trick of insulating the external
library dependencies to separate extractor classes nicely solved that
problem by delaying the exceptions to the actual parse() method calls
on specific document types, which obviously would then give the end
user a much better idea of what's wrong.

Perhaps the best solution would actually be to combine the above
approaches, i.e. to strive to maintain the parser/extractor separation
where possible and to use a catch block in the TikaConfig constructor
to catch and ignore any problems that the insulation approach fails to
address.

IIRC, the main concern about this approach is when people are using custom parsers, where instantiation exceptions can happen due to bugs in the actual parser (versus explicitly excluded jars). Quietly ignoring these errors leads to late failing, which can be a bad thing.

What I would propose is two changes:

1. Add a new TikaConfig(ClassLoader, Class<Parser>...) constructor that can be used to instantiate all parsers from the variable list that around found using the ClassLoader. For example:

public TikaConfig(ClassLoader loader, Class<Parser>...targetParsers)
           throws MimeTypeException, IOException {
       for (Class<Parser> parserClass : targetParsers) {
           ParseContext context = new ParseContext();

           try {
               Parser parser = parserClass.newInstance();
for (MediaType type : parser.getSupportedTypes(context)) {
                   parsers.put(type, parser);
               }
           } catch (InstantiationException e) {
               throw new IOException(e);
           } catch (IllegalAccessException e) {
               throw new IOException(e);
           }
       }

       mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
   }

So after looking again at the code snippet I threw together above, it's not using the provided Classloader. I could iterate over parsers and catch/ignore errors to parsers not in the provided list, but that seems less than clean.

I don't have much experience with classloaders - I see that each instance of a Class has a classloader associated with it, to mapping from its classload to the provided classloader would need something like:

Class<Parser> resolvedClass = (Class<Parser>)loader.loadClass(parserClass.getCanonicalName());
    Parser parser = resolvedClass.newInstance();

But that also seems clunky. Any other suggestions?

As an aside, what's the standard use case for specifying an explicit classloader? I haven't seen this used in other projects, so I'm curious.

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to