[jira] [Created] (TIKA-1897) Too many daemon threads when NamedEntityParser is enabled
Manali Shah created TIKA-1897: - Summary: Too many daemon threads when NamedEntityParser is enabled Key: TIKA-1897 URL: https://issues.apache.org/jira/browse/TIKA-1897 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.12 Environment: MAC_OS_X 10.10.5 JDK 1.8.0_45 Tika Version 1.13-SNAPSHOT Reporter: Manali Shah Thread Dump: {code} "Apache Tika" #2410 daemon prio=5 os_prio=31 tid=0x7fa12cb22800 nid=0x101103 in Object.wait() [0x0001a78c8000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.io.PipedReader.receive(PipedReader.java:185) - eliminated <0x000797c69830> (a java.io.PipedReader) at java.io.PipedReader.receive(PipedReader.java:206) - locked <0x000797c69830> (a java.io.PipedReader) at java.io.PipedWriter.write(PipedWriter.java:150) at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278) at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305) at org.apache.tika.parser.ner.NamedEntityParser.extractOutput(NamedEntityParser.java:172) at org.apache.tika.parser.ner.NamedEntityParser.parse(NamedEntityParser.java:154) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:235) at java.lang.Thread.run(Thread.java:745) "Apache Tika" #2409 daemon prio=5 os_prio=31 tid=0x7fa12cb21800 nid=0x100f03 in Object.wait() [0x0001a77c5000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.io.PipedReader.receive(PipedReader.java:185) - eliminated <0x000797a477c8> (a java.io.PipedReader) at java.io.PipedReader.receive(PipedReader.java:206) - locked <0x000797a477c8> (a java.io.PipedReader) at java.io.PipedWriter.write(PipedWriter.java:150) at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187816#comment-15187816 ] Tim Allison commented on TIKA-1508: --- sketch of ParamValue... {code} public class ParamValue { enum Type { BOOLEAN, INTEGER, LONG, FLOAT, DOUBLE, STRING } final Type type; final String val; public ParamValue(int intVal) { this.type = Type.INTEGER; this.val = Integer.toString(intVal); } //... public ParamValue(Type type, String value) { this.type = type; this.val = value; } public boolean getBoolean() { if (! type.equals(Type.BOOLEAN)) { throw new IllegalArgumentException("can't cast a "+type+ " to a boolean"); } if ("true".equals(val)) { return true; } else if ("false".equals(val)) { return false; } throw new IllegalArgumentException("Couldn't parse "+ val + " as a boolean; must be 'true' or 'false'"); } {code} > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187805#comment-15187805 ] Tim Allison commented on TIKA-1508: --- [~gagravarr] and [~chrismattmann], I agree about the distinction, but if we can use the same mechanism for both, let's do it? What do you think of a compromise... A user can send in a Map into the ParseContext with a key of the parser class name that is supposed to use them. Want to set pdf config: {code} context.set(PDFParser.class, pdfParams) {code} how about parameters for the RTFParser with clashing parameter names, no problem: {code} context.set(RTFParser.class, rtfParams) {code} Each configurable parser would then be responsible for checking the context to see if there was a value to its own class and then set its own parameters. > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187784#comment-15187784 ] Tim Allison edited comment on TIKA-1508 at 3/9/16 8:04 PM: --- bq. Maybe not too complex, but not as a start Just my 2c. The reason I propose this to start is so that we don't have to worry about changing our config and backward compatibility ... :) bq. I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration. I think we gain quite a bit. The reason I suggested it is tied to 3)...What we would gain is automatic type checking/verification on loading from the config file. If the configurator were something like this: {code} public static void configure(Configurable configurable, Map params) throws TikaConfigParameterException { for (String k : params.keySet()) { //camel case the first character String setterName = "set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1); try { ParamValue v = params.get(k); Method method = configurable.getClass().getDeclaredMethod(setterName, v.getTypeClass()); switch (v.getType()) { case BOOLEAN: method.invoke(configurable, v.getBoolean()); break; case INTEGER: method.invoke(configurable, v.getInteger()); break; // } } catch (Exception e) { throw new TikaConfigParameterException("Exception with parameter: " + k +" with class: " + configurable.getClass(), e); } } } {code} Then each parser that had configurations wouldn't have to register its configurable parameters (strike that suggestion above :) ), but there would be an exception at creation time if the {{setN}} method with a correctly typed parameter didn't exist. In short, small bit of code at the outset, but each parser wouldn't then have to repeat the {{parseInt}} and handle NumberFormatExceptions, etc. Each configurable parser wouldn't have to worry about configuration at all, except to have appropriate setters. was (Author: talli...@mitre.org): bq. Maybe not too complex, but not as a start Just my 2c. The reason I propose this to start is so that we don't have to worry about changing our config and backward compatibility ... :) bq. I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration. I think we gain quite a bit. The reason I suggested it is tied to 3)...What we would gain is automatic type checking/verification on loading from the config file. If the configurator were something like this: {code} public static void configure(Configurable configurable, Map params) throws TikaConfigParameterException { for (String k : params.keySet()) { //camel case the first character String setterName = "set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1); try { Method method = configurable.getClass().getDeclaredMethod(setterName, Boolean.class); ParamValue v = params.get(k); switch (v.getType()) { case BOOLEAN: method.invoke(configurable, v.getBoolean()); break; case INTEGER: method.invoke(configurable, v.getInteger()); break; // } } catch (Exception e) { throw new TikaConfigParameterException("Exception with parameter: " + k +" with class: " + configurable.getClass(), e); } } } {code} Then each parser that had configurations wouldn't have to register its configurable parameters (strike that suggestion above :) ), but there would be an exception at creation time if the {{setN}} method with a correctly typed parameter didn't exist. In short, small bit of code at the outset, but each parser wouldn't then have to repeat the {{parseInt}} and handle NumberFormatExceptions, etc. Each configurable parser wouldn't have to worry about configuration at all, except to have appropriate setters. > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 >
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187784#comment-15187784 ] Tim Allison commented on TIKA-1508: --- bq. Maybe not too complex, but not as a start Just my 2c. The reason I propose this to start is so that we don't have to worry about changing our config and backward compatibility ... :) bq. I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration. I think we gain quite a bit. The reason I suggested it is tied to 3)...What we would gain is automatic type checking/verification on loading from the config file. If the configurator were something like this: {code} public static void configure(Configurable configurable, Map params) throws TikaConfigParameterException { for (String k : params.keySet()) { //camel case the first character String setterName = "set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1); try { Method method = configurable.getClass().getDeclaredMethod(setterName, Boolean.class); ParamValue v = params.get(k); switch (v.getType()) { case BOOLEAN: method.invoke(configurable, v.getBoolean()); break; case INTEGER: method.invoke(configurable, v.getInteger()); break; // } } catch (Exception e) { throw new TikaConfigParameterException("Exception with parameter: " + k +" with class: " + configurable.getClass(), e); } } } {code} Then each parser that had configurations wouldn't have to register its configurable parameters (strike that suggestion above :) ), but there would be an exception at creation time if the {{setN}} method with a correctly typed parameter didn't exist. In short, small bit of code at the outset, but each parser wouldn't then have to repeat the {{parseInt}} and handle NumberFormatExceptions, etc. Each configurable parser wouldn't have to worry about configuration at all, except to have appropriate setters. > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1657) Allow easier XML serialization of TikaConfig
[ https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187699#comment-15187699 ] Thamme Gowda N commented on TIKA-1657: -- [~talli...@mitre.org][~gagravarr][~chrismattmann] I am wondering if you have considered the option of creating model classes for all the configuration elements, and then using JAXB to easily convert to-and-from XML for (De)Serialization.? > Allow easier XML serialization of TikaConfig > > > Key: TIKA-1657 > URL: https://issues.apache.org/jira/browse/TIKA-1657 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1558-blacklist-effective.xml, TIKA-1657v1.patch > > > In TIKA-1418, we added an example for how to dump the config file so that > users could easily modify it. I think we should go further and make this an > option at the tika-core level with hooks for tika-app and tika-server. I > propose adding a main() to TikaConfig that will print the xml config file > that Tika is currently using to stdout. > I'd like to put this into core so that e.g. Solr's DIH users can get by > without having to download tika-app separately. > There's every chance that I've not accounted for issues with dynamic loading > etc. Also, I'd be ok with only having this available in tika-app and > tika-server if there are good reasons. > Feedback? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187683#comment-15187683 ] Thamme Gowda N commented on TIKA-1508: -- 1. Please Let me know the final verdict when all of you agree to one thing, I will make changes as per the recommendation. 2. +1. Agreed. I will update the code 3. I really like the suggestion. That would allow us to validate parameters and fail early when they are wrong. But I think it requires a lot of rework on the side of Parsers as well. Parsers have to declare what params they expect from the configuration file, it is only after that we will be able to validate. Another simple/lazy approach is to simply assume all params are valid, pass all the params and let the parser raise exception when there are errors. The current PR has the latter approach. Let me know what you think? 4. +1 Agreed. Will update the code. 5. Anything that extends AbstractParser is now instance of Configurable. Anything that is an instance of Configurable will be checked and invoked with params while instantiating them. So ParserDecorator, DelegatingParser, ParserPostProcessor are all covered, Yay!! If no params are found in config file, a call is made with empty Map. Now it is up to the implementation of these parsers to make use of params by overriding configure() method. A & B) I think solr way is complex to implement considering that we dont gain much after the effort (As of now we can just do Integer.parse() or similar ). Plus it introduces ambiguities with the type expected by parsers and the values supplied from configuration. Being said that, I am open to all the suggestions. > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187290#comment-15187290 ] Tim Allison commented on TIKA-1663: --- [~gagravarr], am I right in that we cannot do this now: {code} {code} > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187205#comment-15187205 ] Nick Burch commented on TIKA-1508: -- > I think that's exactly what ParseContext should be for..it should be a > vehicle for Param passing. We can delineate by property name (FQ) and/or by > class. I view {{ParseContext}} as somewhere you configure things on a per-document basis, not a per-parser basis. So, need to set where Tesseract lives on your system? Applies to everything, so on the parser. Need to tell Tesseract to use a German not an English dictionary on this particular jpeg? Applies to just this one document being parserd, so on the {{ParseContext}} > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187196#comment-15187196 ] Chris A. Mattmann commented on TIKA-1508: - Tim and Thamme: bq. 1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses configure() outside of the normal course of events...it is simpler to use Map. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. context.set{{PDFParser.class, Map params. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly. I think that's exactly what ParseContext should be for..it should be a vehicle for Param passing. We can delineate by property name (FQ) and/or by class. bq. 4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one. +1 bq. A) Are we ok with Map parameters? Or should we follow, say, Solr's syntax for type checking? Yes I'm OK with Map bq. B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a configure(Configurable configurable, Map params method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex? Maybe not too complex, but not as a start :) Just my 2c. > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117 ] Tim Allison edited comment on TIKA-1508 at 3/9/16 1:56 PM: --- [~thammegowda], this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly. Recommendations: 1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses {{configure()}} outside of the normal course of events...it is simpler to use Map. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. {{context.set{{PDFParser.class, Map params}}. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly. 2) We need to add a {{Map getParams()}} to the {{Configurable}} interface so that when we serialize the config to XML, we can remember what the params were. We should also add that to the TikaConfigSerializer. 3) It would be great to add parameter checking into the {{AbstractParser}} or somewhere else? I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during {{configure()}} once in the base class so that each extending parser isn't required to do this on its own. 4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one. 5) We'll need to add {{@Override configure()}} to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the parameters in the wrapped parser before wrapping? Questions for the broader dev community: A) Are we ok with Map parameters? Or should we follow, say, Solr's syntax for type checking? {noformat} 10 {noformat} B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a {{configure(Configurable configurable, Map params}} method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex? was (Author: talli...@mitre.org): [~thammegowda], this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly. Recommendations: 1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses {{configure()}} outside of the normal course of events...it is simpler to use Map. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. {{context.set{{PDFParser.class, Map params}}. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly. 2) We need to add a {{Map getParams()}} to the {{Configurable}} interface so that when we serialize the config to XML, we can remember what the params were. We should also add that to the TikaConfigSerializer. 3) It would be great to add parameter checking into the {{AbstractParser}} or somewhere else? I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during {{configure()}} once in the base class so that each extending parser isn't required to do this on its own. 4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one. 5) We'll need to add {{@Override configure()}} to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the parameters in the wrapped parser before wrapping? Questions for the broader dev community: A) Are we ok with Map parameters? Or should we follow, say, Solr's syntax for type checking? {{noformat}} 10 {{noformat}} B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a {{configure(Configurable configurable, Map params}} method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex? > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently confi
[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117 ] Tim Allison commented on TIKA-1508: --- [~thammegowda], this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly. Recommendations: 1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses {{configure()}} outside of the normal course of events...it is simpler to use Map. Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. {{context.set{{PDFParser.class, Map params}}. I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly. 2) We need to add a {{Map getParams()}} to the {{Configurable}} interface so that when we serialize the config to XML, we can remember what the params were. We should also add that to the TikaConfigSerializer. 3) It would be great to add parameter checking into the {{AbstractParser}} or somewhere else? I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during {{configure()}} once in the base class so that each extending parser isn't required to do this on its own. 4) Let's subclass TikaException for TikaParameterConfigException? I don't feel strongly about this one. 5) We'll need to add {{@Override configure()}} to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the parameters in the wrapped parser before wrapping? Questions for the broader dev community: A) Are we ok with Map parameters? Or should we follow, say, Solr's syntax for type checking? {{noformat}} 10 {{noformat}} B) We could use reflection to get around each parser having to add its own configuration code. We could create a static configurator that has a {{configure(Configurable configurable, Map params}} method. That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines. Too complex? > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)