Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "Tika2_0RoadMap" page has been changed by NickBurch: https://wiki.apache.org/tika/Tika2_0RoadMap?action=diff&rev1=8&rev2=9 Comment: Some updates from my talk and from post-talk discussions * Solve the complex metadata challenge; see: [[https://issues.apache.org/jira/browse/TIKA-1607|TIKA-1607]] and [[https://issues.apache.org/jira/browse/TIKA-1691|TIKA-1691]] and [[http://mail-archives.apache.org/mod_mbox/incubator-tika-dev/201510.mbox/%[email protected]%3e|ISO 19115 discussion]] .... Or at least come to some accommodation that will allow for both easy key/values access and more advanced access for those who know what they're doing. + * Work out how to allow "resetting" or "augmenting" or "rewinding" of the SAX stream, to permit: + * We tried one parser, got half way through and it failed, and now we want to try another + * We used on parser, that finished, now we want to run a second one (eg OCR) + * We finished one paragraph, then did NER on it, and want to update the HTML with the entities + * We want to mark the last 2 paragraphs as ''language=german'' or unmark ''language=english'' on the body now we've found some german text + + * Parsers vs Content Handlers vs Decorators - Work out where we want "content enhancement" logic to live (Wrapping Parser? Decorator? Handler? Other). Then, ensure that can be configured in easily (config xml as well as code), can do what it needs, then shift things over to the new model if they're not there already + = Major Completed / Mostly-Completed Changes = * Allow for easily configurable parser sub-packages. The tika-app, tika-server and tika-bundle jars are now pushing or are > 50MB. It would be great if users easily could specify a subset of parsers they care about, either a la carte or by category (image, common office files (MSOffice, PDF, etc.), environmental data) and only get the dependencies required for that subset of parsers. @@ -32, +40 @@ * Move to Java 1.8 (???) + * Have mail "folder" formats (such as mbox) behave more like other containers, triggering embedded documents for each of their mail messages rather than mushing everything together = Wishes = * Uniform representation of geo information from common files (kml/kmz/exif/others???) in metadata (perhaps [[https://en.wikipedia.org/wiki/Well-known_text|WKT]])?
