Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "CompositeParserDiscussion" page has been changed by NickBurch:
https://wiki.apache.org/tika/CompositeParserDiscussion?action=diff&rev1=2&rev2=3

Comment:
Start on config

  The right strategy for one user may not be the right for another. The right 
strategy for one file may not be the right one for another. We therefore need 
to allow users to pick their strategy, on an overall basis, and on a per-file 
basis
  
  == From TikaConfig ==
- ''TODO''
+ Currently, a great many Tika users just call 
{{{TikaConfig.getDefaultConfig()}}} and go with that.
+ 
+ It might be nice if they could also do things like 
{{{TikaConfig.getMaxiumMetadataConfig()}}} or 
{{{TikaConfig.getTryEachInTurnConfig()}}} to pick a different strategy
+ 
+ (Naming TBC, align with above)
  
  == With a Tika Configuration file ==
- ''TODO''
+ Users may wish to have full control over what parsers are used, what 
strategies are used for which mime types etc
+ 
+ For example, they might want default behaviour for most types, but to send 
XML through a fallback parser, and combine Image + GDAL + OCR for jpeg. The 
configuration file needs to support this
+ 
+ {{{
+   <parsers>
+     <!-- Most things can use the default -->
+     <parser class="org.apache.tika.parser.DefaultParser">
+       <mime-exclude>image/jpeg</mime-exclude>
+       <mime-exclude>application/xml</mime-exclude>
+       <mime-exclude>application/pdf</mime-exclude>
+     </parser>
+ 
+     <!-- No PDF, thank you! -->
+     <parser class="org.apache.tika.parser.EmptyParser">
+       <mime>application/pdf</mime>
+     </parser>
+ 
+     <!-- JPEG needs special handling -->
+     <!-- XML needs special handling -->
+   </parsers>
+ }}}
  
  == In Code ==
  ''TODO''

Reply via email to