Yeah we should read props from Tika config and flow into parse context 

Sent from my iPhone

> On Jun 15, 2015, at 8:57 AM, Tim Allison (JIRA) <j...@apache.org> wrote:
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586235#comment-14586235
>  ] 
> 
> Tim Allison commented on TIKA-1508:
> -----------------------------------
> 
> To respond to Nick's point about including concrete examples, I think that 
> there are two parsers that use a .properties file to configure baseline 
> configuration: TesseractOCR and PDFParser.  The external parser configuration 
> is far more complex and beyond the scope of this issue...I think.
> 
> So for the PDFParser, there are these Boolean properties:
> 
> {noformat}
> enableAutospace true
> extractAnnotationText true
> sortByPosition    false
> suppressDuplicateOverlappingText    false
> useNonSequentialParser    false
> extractAcroFormContent    true
> extractInlineImages false
> extractUniqueInlineImagesOnly true
> checkExtractAccessPermission false
> allowExtractionForAccessibility true
> {noformat}
> 
> These can be set via the .properties file, or through code by passing in a 
> PDFParserConfig object via the ParseContext or via direct calls to setters in 
> PDFParser.
> 
> For this issue, I propose a simple unifying mechanism to specify default 
> behavior for parser parameters so that a user can see and manipulate the 
> options from one file.
> 
> This proposal would not break (I don't think; but I've had failures of 
> imagination before...) any dynamic code that clients are using via 
> ParseContext.
> 
>> Add uniformity to parser parameter configuration
>> ------------------------------------------------
>> 
>>                Key: TIKA-1508
>>                URL: https://issues.apache.org/jira/browse/TIKA-1508
>>            Project: Tika
>>         Issue Type: Improvement
>>           Reporter: Tim Allison
>>            Fix For: 1.10
>> 
>> 
>> We can currently configure parsers by the following means:
>> 1) programmatically by direct calls to the parsers or their config objects
>> 2) sending in a config object through the ParseContext
>> 3) modifying .properties files for specific parsers (e.g. PDFParser)
>> Rather than scattering the landscape with .properties files for each parser, 
>> it would be great if we could specify parser parameters in the main config 
>> file, something along the lines of this:
>> {noformat}
>>    <parser class="org.apache.tika.parser.audio.AudioParser">
>>      <params>
>>        <int name="someparam1">2</int>
>>        <str name="someOtherParam2">something or other</str>
>>      </params>
>>      <mime>audio/basic</mime>
>>      <mime>audio/x-aiff</mime>
>>      <mime>audio/x-wav</mime>
>>    </parser>
>> {noformat}
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Reply via email to