The Parser conflict Jira issue I was referring to is TIKA-527 (Allow override mapping mime<-->parsers through config<https://issues.apache.org/jira/browse/TIKA-527>). We would need something similar for mime types.
The updates to MimeTypesFactory might address TIKA-87 (MimeTypes should allow modification of MIME types<https://issues.apache.org/jira/browse/TIKA-87>). The MimeType class wouldn't be modifiable but the goal of loading mime types from multiple sources would be achieved. -Tom On Tue, Aug 23, 2011 at 3:58 PM, Tom Grant <tgr...@sms-fed.com> wrote: > Nick, > I'm happy to do the work and contribute it as a patch. I guess I'm just > looking for advice on the approach to ensure that what I provide does > actually get incorporated. My particular use case is solved by adding the > update methods to MimeTypesFactory (See last message), but I'm in a scenario > where I'm appending new types and not over-writing any of the existing > ones. The over-writing use case would also bring in a conflict resolution > requirement like the recent addition to the Parser loading logic. > > I personally like the approach of loading the standard > /org/apache/tika/mime/tika-mimetypes.xml file, followed by any and all > META-INF/tika-mimetypes.xml resources using the ServiceLoader class, > followed by an optional tika-mimetypes.xml resource from the classpath for > conflict resolution. This would handle my use case and the over-write one. > > > I still like having the update methods on MimeTypeFactory because in my > particular application we use multiple classloaders to isolate the Parser > implementations and its easier for me to push the information to Tika than > to have Tika pull the information from my application on startup. > > -Tom > > > > On Tue, Aug 23, 2011 at 7:20 AM, Nick Burch <nick.bu...@alfresco.com>wrote: > >> On Mon, 22 Aug 2011, Tom Grant wrote: >> >>> Here's the use case that I'm attempting to solve. I have a customer with >>> many legacy systems, some of which are completely custom. These systems >>> have data files that will never be seen outside of their environment. For >>> example, some are XML files with their own schemas. Some are similar to the >>> new office documents and are zip files containing xml and other goodies. >>> Others are serialized-objects dumped to disk. Some are similar to EDI with >>> a header and data body with prescribed offsets. The choices of the past >>> can't be undone and I'm stuck with about 30 or 40 different file types. >>> >> >> Ah, so you have non standard, custom and specific mimetypes that you're >> allocating to these documents. I think we'd tended to think of the mimetypes >> as always being like constants >> >> >> The quantity of file types means that its going to take a few months to >>> complete and will happen a few at a time. So I'd like to co-locate the >>> mimetype definition with the parser code for maintainability. >>> >> >> Your best bet is probably to do a custom detector, and have that loaded by >> the service loader the same way that the container aware detector now can >> be. You can put that in your code along with your custom parsers >> >> >> I'm not sure what the best way to support this kind of need is. Some >> options that spring to mind are: >> * Loading multiple mimetype files, and merging them like we do for parser >> class loading >> * Provide another detector that loads custom-mimetypes.xml files from the >> service loader (so you can have multiple ones) which are used for >> detection >> >> I guess it depends on if you'd expect to be able to work with the >> heirarchy of the custom extra types or not? >> >> I'm not sure we should be proving ways to add a couple of extra types in >> at a random point in time, as that'll potentially make things behave very >> differently in a multithreaded environment. I'd rather that the extra types >> were loaded once up front, in whichever way is supported >> >> Nick >> > > >