Hi,

Ken Krugler schrieb:
> I'd been running a large web crawl in EC2, using a Hadoop job jar where
> I'd excluded all of the support jars used for Microsoft formats. This
> dramatically reduced the size of the job jar that I needed to constantly
> push to EC2 via a relatively slow DSL connection.
> 
> During the crawl, I ignored all responses that didn't have a mime-type
> of text/plain or one of the three HTML mime-types.
> 
> But I ran into a problem, where the Tika auto-detect code was correctly
> identifying  a file as being a Microsoft format, even though the server
> said it was text/plain. The Tika Microsoft parser would try to
> dynamically figure out which support code to call, and in the end it
> throws a NoSuchMethodError.

This sounds much like either an interface/implementation mismatch. It
would probably be interesting to investigate how and why this may happen.

> 
> Note that this is an Error, not an Exception. As such, it flies on past
> all of the Tika catch blocks, and my own code's catch blocks, and kills
> the Hadoop job in weird and wonderful ways.

Which may be a good thing, particularly in this situation.

> 
> It seems like Errors shouldn't be thrown for situations where dynamic
> configuration could result in a class not existing, but before I started
> writing up an issue I wanted to get input from the community about this.
> It's a bit gray to me, since I essentially "did it to myself" by
> excluding jars.

Point seems to be that there is something weird about the actual class
used because the requested method is missing of an existing class....

To me this sounds very much like a bug and a stacktrace might give hints
as to who caused the Error and why....

Just my €.02 from outside and probably without enough knowledge ;-)

Regards
Felix

Reply via email to