[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586235#comment-14586235 ]
Tim Allison commented on TIKA-1508: ----------------------------------- To respond to Nick's point about including concrete examples, I think that there are two parsers that use a .properties file to configure baseline configuration: TesseractOCR and PDFParser. The external parser configuration is far more complex and beyond the scope of this issue...I think. So for the PDFParser, there are these Boolean properties: {noformat} enableAutospace true extractAnnotationText true sortByPosition false suppressDuplicateOverlappingText false useNonSequentialParser false extractAcroFormContent true extractInlineImages false extractUniqueInlineImagesOnly true checkExtractAccessPermission false allowExtractionForAccessibility true {noformat} These can be set via the .properties file, or through code by passing in a PDFParserConfig object via the ParseContext or via direct calls to setters in PDFParser. For this issue, I propose a simple unifying mechanism to specify default behavior for parser parameters so that a user can see and manipulate the options from one file. This proposal would not break (I don't think; but I've had failures of imagination before...) any dynamic code that clients are using via ParseContext. > Add uniformity to parser parameter configuration > ------------------------------------------------ > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Fix For: 1.10 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > <parser class="org.apache.tika.parser.audio.AudioParser"> > <params> > <int name="someparam1">2</int> > <str name="someOtherParam2">something or other</str> > </params> > <mime>audio/basic</mime> > <mime>audio/x-aiff</mime> > <mime>audio/x-wav</mime> > </parser> > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)