[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586235#comment-14586235
 ] 

Tim Allison commented on TIKA-1508:
-----------------------------------

To respond to Nick's point about including concrete examples, I think that 
there are two parsers that use a .properties file to configure baseline 
configuration: TesseractOCR and PDFParser.  The external parser configuration 
is far more complex and beyond the scope of this issue...I think.

So for the PDFParser, there are these Boolean properties:

{noformat}
enableAutospace true
extractAnnotationText true
sortByPosition  false
suppressDuplicateOverlappingText        false
useNonSequentialParser  false
extractAcroFormContent  true
extractInlineImages false
extractUniqueInlineImagesOnly true
checkExtractAccessPermission false
allowExtractionForAccessibility true
{noformat}

These can be set via the .properties file, or through code by passing in a 
PDFParserConfig object via the ParseContext or via direct calls to setters in 
PDFParser.

For this issue, I propose a simple unifying mechanism to specify default 
behavior for parser parameters so that a user can see and manipulate the 
options from one file.

This proposal would not break (I don't think; but I've had failures of 
imagination before...) any dynamic code that clients are using via ParseContext.

> Add uniformity to parser parameter configuration
> ------------------------------------------------
>
>                 Key: TIKA-1508
>                 URL: https://issues.apache.org/jira/browse/TIKA-1508
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>             Fix For: 1.10
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
>     <parser class="org.apache.tika.parser.audio.AudioParser">
>       <params>
>         <int name="someparam1">2</int>
>         <str name="someOtherParam2">something or other</str>
>       </params>
>       <mime>audio/basic</mime>
>       <mime>audio/x-aiff</mime>
>       <mime>audio/x-wav</mime>
>     </parser>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to