Yeah we should read props from Tika config and flow into parse context Sent from my iPhone
> On Jun 15, 2015, at 8:57 AM, Tim Allison (JIRA) <j...@apache.org> wrote: > > > [ > https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586235#comment-14586235 > ] > > Tim Allison commented on TIKA-1508: > ----------------------------------- > > To respond to Nick's point about including concrete examples, I think that > there are two parsers that use a .properties file to configure baseline > configuration: TesseractOCR and PDFParser. The external parser configuration > is far more complex and beyond the scope of this issue...I think. > > So for the PDFParser, there are these Boolean properties: > > {noformat} > enableAutospace true > extractAnnotationText true > sortByPosition false > suppressDuplicateOverlappingText false > useNonSequentialParser false > extractAcroFormContent true > extractInlineImages false > extractUniqueInlineImagesOnly true > checkExtractAccessPermission false > allowExtractionForAccessibility true > {noformat} > > These can be set via the .properties file, or through code by passing in a > PDFParserConfig object via the ParseContext or via direct calls to setters in > PDFParser. > > For this issue, I propose a simple unifying mechanism to specify default > behavior for parser parameters so that a user can see and manipulate the > options from one file. > > This proposal would not break (I don't think; but I've had failures of > imagination before...) any dynamic code that clients are using via > ParseContext. > >> Add uniformity to parser parameter configuration >> ------------------------------------------------ >> >> Key: TIKA-1508 >> URL: https://issues.apache.org/jira/browse/TIKA-1508 >> Project: Tika >> Issue Type: Improvement >> Reporter: Tim Allison >> Fix For: 1.10 >> >> >> We can currently configure parsers by the following means: >> 1) programmatically by direct calls to the parsers or their config objects >> 2) sending in a config object through the ParseContext >> 3) modifying .properties files for specific parsers (e.g. PDFParser) >> Rather than scattering the landscape with .properties files for each parser, >> it would be great if we could specify parser parameters in the main config >> file, something along the lines of this: >> {noformat} >> <parser class="org.apache.tika.parser.audio.AudioParser"> >> <params> >> <int name="someparam1">2</int> >> <str name="someOtherParam2">something or other</str> >> </params> >> <mime>audio/basic</mime> >> <mime>audio/x-aiff</mime> >> <mime>audio/x-wav</mime> >> </parser> >> {noformat} > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)