Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "Tika2_0RoadMap" page has been changed by NickBurch:
https://wiki.apache.org/tika/Tika2_0RoadMap?action=diff&rev1=8&rev2=9

Comment:
Some updates from my talk and from post-talk discussions

  
   * Solve the complex metadata challenge; see: 
[[https://issues.apache.org/jira/browse/TIKA-1607|TIKA-1607]] and 
[[https://issues.apache.org/jira/browse/TIKA-1691|TIKA-1691]] and 
[[http://mail-archives.apache.org/mod_mbox/incubator-tika-dev/201510.mbox/%[email protected]%3e|ISO
 19115 discussion]] .... Or at least come to some accommodation that will allow 
for both easy key/values access and more advanced access for those who know 
what they're doing.
  
+  * Work out how to allow "resetting" or "augmenting" or "rewinding" of the 
SAX stream, to permit:
+   * We tried one parser, got half way through and it failed, and now we want 
to try another
+   * We used on parser, that finished, now we want to run a second one (eg OCR)
+   * We finished one paragraph, then did NER on it, and want to update the 
HTML with the entities
+   * We want to mark the last 2 paragraphs as ''language=german'' or unmark 
''language=english'' on the body now we've found some german text
+ 
+  * Parsers vs Content Handlers vs Decorators - Work out where we want 
"content enhancement" logic to live (Wrapping Parser? Decorator? Handler? 
Other). Then, ensure that can be configured in easily (config xml as well as 
code), can do what it needs, then shift things over to the new model if they're 
not there already
+ 
  = Major Completed / Mostly-Completed Changes =
  
   * Allow for easily configurable parser sub-packages.  The tika-app, 
tika-server and tika-bundle jars are now pushing or are > 50MB.  It would be 
great if users easily could specify a subset of parsers they care about, either 
a la carte or by category (image, common office files (MSOffice, PDF, etc.), 
environmental data) and only get the dependencies required for that subset of 
parsers. 
@@ -32, +40 @@

  
   * Move to Java 1.8 (???)
  
+  * Have mail "folder" formats (such as mbox) behave more like other 
containers, triggering embedded documents for each of their mail messages 
rather than mushing everything together
  
  = Wishes =
   * Uniform representation of geo information from common files 
(kml/kmz/exif/others???) in metadata (perhaps 
[[https://en.wikipedia.org/wiki/Well-known_text|WKT]])? 

Reply via email to