[jira] [Created] (TIKA-1897) Too many daemon threads when NamedEntityParser is enabled

2016-03-09 Thread Manali Shah (JIRA)
Manali Shah created TIKA-1897:
-

 Summary: Too many daemon threads when NamedEntityParser is enabled
 Key: TIKA-1897
 URL: https://issues.apache.org/jira/browse/TIKA-1897
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.12
 Environment: MAC_OS_X 10.10.5
JDK 1.8.0_45
Tika Version  1.13-SNAPSHOT
Reporter: Manali Shah


Thread Dump:
{code}

"Apache Tika" #2410 daemon prio=5 os_prio=31 tid=0x7fa12cb22800 
nid=0x101103 in Object.wait() [0x0001a78c8000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.io.PipedReader.receive(PipedReader.java:185)
- eliminated <0x000797c69830> (a java.io.PipedReader)
at java.io.PipedReader.receive(PipedReader.java:206)
- locked <0x000797c69830> (a java.io.PipedReader)
at java.io.PipedWriter.write(PipedWriter.java:150)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at 
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)
at 
org.apache.tika.parser.ner.NamedEntityParser.extractOutput(NamedEntityParser.java:172)
at 
org.apache.tika.parser.ner.NamedEntityParser.parse(NamedEntityParser.java:154)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:235)
at java.lang.Thread.run(Thread.java:745)

"Apache Tika" #2409 daemon prio=5 os_prio=31 tid=0x7fa12cb21800 
nid=0x100f03 in Object.wait() [0x0001a77c5000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.io.PipedReader.receive(PipedReader.java:185)
- eliminated <0x000797a477c8> (a java.io.PipedReader)
at java.io.PipedReader.receive(PipedReader.java:206)
- locked <0x000797a477c8> (a java.io.PipedReader)
at java.io.PipedWriter.write(PipedWriter.java:150)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika

[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187816#comment-15187816
 ] 

Tim Allison commented on TIKA-1508:
---

sketch of ParamValue...

{code}
public class ParamValue {

enum Type {
BOOLEAN,
INTEGER,
LONG,
FLOAT,
DOUBLE,
STRING
}

final Type type;
final String val;

public ParamValue(int intVal) {
this.type = Type.INTEGER;
this.val = Integer.toString(intVal); 
}

//...

public ParamValue(Type type, String value) {
this.type = type;
this.val = value;
}

public boolean getBoolean() {
if (! type.equals(Type.BOOLEAN)) {
throw new IllegalArgumentException("can't cast a "+type+ " to a 
boolean");
}
if ("true".equals(val)) {
return true;
} else if ("false".equals(val)) {
return false;
}
throw new IllegalArgumentException("Couldn't parse "+ val + " as a 
boolean; must be 'true' or 'false'");
}
{code}

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187805#comment-15187805
 ] 

Tim Allison commented on TIKA-1508:
---

[~gagravarr] and [~chrismattmann], I agree about the distinction, but if we can 
use the same mechanism for both, let's do it?

What do you think of a compromise...

A user can send in a Map into the ParseContext with a key of 
the parser class name that is supposed to use them.

Want to set pdf config:

{code}
context.set(PDFParser.class, pdfParams)
{code}

how about parameters for the RTFParser with clashing parameter names, no 
problem:

{code}
context.set(RTFParser.class, rtfParams)
{code}

Each configurable parser would then be responsible for checking the context to 
see if there was a value to its own class and then set its own parameters.


> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187784#comment-15187784
 ] 

Tim Allison edited comment on TIKA-1508 at 3/9/16 8:04 PM:
---

bq. Maybe not too complex, but not as a start  Just my 2c.

The reason I propose this to start is so that we don't have to worry about 
changing our config and backward compatibility ... :)


bq. I think solr way is complex to implement considering that we dont gain much 
after the effort (As of now we can just do Integer.parse() or similar ). Plus 
it introduces ambiguities with the type expected by parsers and the values 
supplied from configuration.

I think we gain quite a bit.  The reason I suggested it is tied to 3)...What we 
would gain is automatic type checking/verification on loading from the config 
file.

If the configurator were something like this:

{code}
public static void configure(Configurable configurable, Map params)
throws TikaConfigParameterException {
for (String k : params.keySet()) {
//camel case the first character
String setterName = 
"set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1);
try {
ParamValue v = params.get(k);
Method method = 
configurable.getClass().getDeclaredMethod(setterName, v.getTypeClass());
switch (v.getType()) {
case BOOLEAN:
method.invoke(configurable, v.getBoolean());
break;
case INTEGER:
method.invoke(configurable, v.getInteger());
break;
//
}
} catch (Exception e) {
throw new TikaConfigParameterException("Exception with 
parameter: " + k +" with class: " +
configurable.getClass(), e);
}
}
}
{code}

Then each parser that had configurations wouldn't have to register its 
configurable parameters (strike that suggestion above :) ), but there would be 
an exception at creation time if the {{setN}} method with a correctly typed 
parameter didn't exist.

In short, small bit of code at the outset, but each parser wouldn't then have 
to repeat the {{parseInt}} and handle NumberFormatExceptions, etc.  Each 
configurable parser wouldn't have to worry about configuration at all, except 
to have appropriate setters.




was (Author: talli...@mitre.org):
bq. Maybe not too complex, but not as a start  Just my 2c.

The reason I propose this to start is so that we don't have to worry about 
changing our config and backward compatibility ... :)


bq. I think solr way is complex to implement considering that we dont gain much 
after the effort (As of now we can just do Integer.parse() or similar ). Plus 
it introduces ambiguities with the type expected by parsers and the values 
supplied from configuration.

I think we gain quite a bit.  The reason I suggested it is tied to 3)...What we 
would gain is automatic type checking/verification on loading from the config 
file.

If the configurator were something like this:

{code}
public static void configure(Configurable configurable, Map params)
throws TikaConfigParameterException {
for (String k : params.keySet()) {
//camel case the first character
String setterName = 
"set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1);
try {
Method method = 
configurable.getClass().getDeclaredMethod(setterName, Boolean.class);
ParamValue v = params.get(k);
switch (v.getType()) {
case BOOLEAN:
method.invoke(configurable, v.getBoolean());
break;
case INTEGER:
method.invoke(configurable, v.getInteger());
break;
//
}
} catch (Exception e) {
throw new TikaConfigParameterException("Exception with 
parameter: " + k +" with class: " +
configurable.getClass(), e);
}
}
}
{code}

Then each parser that had configurations wouldn't have to register its 
configurable parameters (strike that suggestion above :) ), but there would be 
an exception at creation time if the {{setN}} method with a correctly typed 
parameter didn't exist.

In short, small bit of code at the outset, but each parser wouldn't then have 
to repeat the {{parseInt}} and handle NumberFormatExceptions, etc.  Each 
configurable parser wouldn't have to worry about configuration at all, except 
to have appropriate setters.



> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
>   

[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187784#comment-15187784
 ] 

Tim Allison commented on TIKA-1508:
---

bq. Maybe not too complex, but not as a start  Just my 2c.

The reason I propose this to start is so that we don't have to worry about 
changing our config and backward compatibility ... :)


bq. I think solr way is complex to implement considering that we dont gain much 
after the effort (As of now we can just do Integer.parse() or similar ). Plus 
it introduces ambiguities with the type expected by parsers and the values 
supplied from configuration.

I think we gain quite a bit.  The reason I suggested it is tied to 3)...What we 
would gain is automatic type checking/verification on loading from the config 
file.

If the configurator were something like this:

{code}
public static void configure(Configurable configurable, Map params)
throws TikaConfigParameterException {
for (String k : params.keySet()) {
//camel case the first character
String setterName = 
"set"+k.substring(0,1).toUpperCase(Locale.ENGLISH)+k.substring(1);
try {
Method method = 
configurable.getClass().getDeclaredMethod(setterName, Boolean.class);
ParamValue v = params.get(k);
switch (v.getType()) {
case BOOLEAN:
method.invoke(configurable, v.getBoolean());
break;
case INTEGER:
method.invoke(configurable, v.getInteger());
break;
//
}
} catch (Exception e) {
throw new TikaConfigParameterException("Exception with 
parameter: " + k +" with class: " +
configurable.getClass(), e);
}
}
}
{code}

Then each parser that had configurations wouldn't have to register its 
configurable parameters (strike that suggestion above :) ), but there would be 
an exception at creation time if the {{setN}} method with a correctly typed 
parameter didn't exist.

In short, small bit of code at the outset, but each parser wouldn't then have 
to repeat the {{parseInt}} and handle NumberFormatExceptions, etc.  Each 
configurable parser wouldn't have to worry about configuration at all, except 
to have appropriate setters.



> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1657) Allow easier XML serialization of TikaConfig

2016-03-09 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187699#comment-15187699
 ] 

Thamme Gowda N commented on TIKA-1657:
--

[~talli...@mitre.org][~gagravarr][~chrismattmann]
I am wondering if you have considered the option of creating model classes for 
all the configuration elements, and then using JAXB to easily convert 
to-and-from XML for (De)Serialization.?


> Allow easier XML serialization of TikaConfig
> 
>
> Key: TIKA-1657
> URL: https://issues.apache.org/jira/browse/TIKA-1657
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
> Attachments: TIKA-1558-blacklist-effective.xml, TIKA-1657v1.patch
>
>
> In TIKA-1418, we added an example for how to dump the config file so that 
> users could easily modify it.  I think we should go further and make this an 
> option at the tika-core level with hooks for tika-app and tika-server.  I 
> propose adding a main() to TikaConfig that will print the xml config file 
> that Tika is currently using to stdout.
> I'd like to put this into core so that e.g. Solr's DIH users can get by 
> without having to download tika-app separately.  
> There's every chance that I've not accounted for issues with dynamic loading 
> etc.  Also, I'd be ok with only having this available in tika-app and 
> tika-server if there are good reasons.
> Feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187683#comment-15187683
 ] 

Thamme Gowda N commented on TIKA-1508:
--

1.  Please Let me know the final verdict when all of you agree to one thing, I 
will make changes as per the recommendation.

2. +1. Agreed. I will update the code

3.  I really like the suggestion. That would allow us to validate parameters 
and fail early when they are wrong.
 But I think it requires a lot of rework on the side of Parsers as well. 
Parsers have to declare what params they expect from the configuration file, it 
is only after that we will be able to validate.  Another simple/lazy approach 
is to simply assume all params are valid, pass all the params and let the 
parser raise exception when there are errors. The current PR  has the latter 
approach. Let me know what you think?

4. +1 Agreed. Will update the code.

5. Anything that extends AbstractParser is now instance of Configurable. 
Anything that is an instance of Configurable will be checked and invoked with 
params while instantiating them. So ParserDecorator, DelegatingParser, 
ParserPostProcessor are all covered, Yay!! If no params are found in config 
file, a call is made with empty Map. Now it is up to the 
implementation of these parsers to make use of params by overriding configure() 
method. 

A & B) I think solr way is complex to implement considering that we dont gain 
much after the effort (As of now we can just do Integer.parse() or similar ). 
Plus it introduces ambiguities with the type expected by parsers and the values 
supplied from configuration.


Being said that, I am open to all the suggestions.




> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187290#comment-15187290
 ] 

Tim Allison commented on TIKA-1663:
---

[~gagravarr], am I right in that we cannot do this now:
{code}
 


 
{code}



> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187205#comment-15187205
 ] 

Nick Burch commented on TIKA-1508:
--

> I think that's exactly what ParseContext should be for..it should be a 
> vehicle for Param passing. We can delineate by property name (FQ) and/or by 
> class.

I view {{ParseContext}} as somewhere you configure things on a per-document 
basis, not a per-parser basis. 

So, need to set where Tesseract lives on your system? Applies to everything, so 
on the parser. Need to tell Tesseract to use a German not an English dictionary 
on this particular jpeg? Applies to just this one document being parserd, so on 
the {{ParseContext}}

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187196#comment-15187196
 ] 

Chris A. Mattmann commented on TIKA-1508:
-

Tim and Thamme:

bq. 1) Let's not use ParseContext as the vehicle for param passing, we will 
have collisions with different parsers if anyone uses configure() outside of 
the normal course of events...it is simpler to use Map. Or, if 
we do use the ParseContext, we should specify which parser the params are for, 
e.g. context.set{{PDFParser.class, Map params. I do like the 
dual use of configure with ParseContext to achieve Nick's recommendation 
elegantly.

I think that's exactly what ParseContext should be for..it should be a vehicle 
for Param passing. We can delineate by property name (FQ) and/or by class.

bq. 4) Let's subclass TikaException for TikaParameterConfigException? I don't 
feel strongly about this one.

+1

bq. A) Are we ok with Map parameters? Or should we follow, say, 
Solr's syntax for type checking?

Yes I'm OK with Map

bq. B) We could use reflection to get around each parser having to add its own 
configuration code. We could create a static configurator that has a 
configure(Configurable configurable, Map params method. That 
isn't quite right, because we'd have to know the type for each param (see 
above), but something along those lines. Too complex?

Maybe not too complex, but not as a start :) Just my 2c.


> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117
 ] 

Tim Allison edited comment on TIKA-1508 at 3/9/16 1:56 PM:
---

[~thammegowda], this looks really good. I merged it on a local branch and made 
minimal modifications to the PDFParser to make this work...and it did...very 
straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have 
collisions with different parsers if anyone uses {{configure()}} outside of the 
normal course of events...it is simpler to use Map.  Or, if we 
do use the ParseContext, we should specify which parser the params are for, 
e.g. {{context.set{{PDFParser.class, Map params}}.  I do like 
the dual use of configure with ParseContext to achieve Nick's recommendation 
elegantly.


2) We need to add a {{Map getParams()}} to the {{Configurable}} 
interface so that when we serialize the config to XML, we can remember what the 
params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or 
somewhere else?  I think a configurable (parser? or all configurables?) should 
need to register valid configuration keys at initialization, and then we can 
check the validity of the keys passed in during {{configure()}} once in the 
base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel 
strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration 
information to the wrapped parser in parser wrappers: ParserDecorator, 
DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the 
parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map parameters? Or should we follow, say, 
Solr's syntax for type checking?
{noformat}
10
{noformat}

B) We could use reflection to get around each parser having to add its own 
configuration code.  We could create a static configurator  that has a 
{{configure(Configurable configurable, Map params}} method.  
That isn't quite right, because we'd have to know the type for each param (see 
above), but something along those lines.  Too complex?


was (Author: talli...@mitre.org):
[~thammegowda], this looks really good. I merged it on a local branch and made 
minimal modifications to the PDFParser to make this work...and it did...very 
straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have 
collisions with different parsers if anyone uses {{configure()}} outside of the 
normal course of events...it is simpler to use Map.  Or, if we 
do use the ParseContext, we should specify which parser the params are for, 
e.g. {{context.set{{PDFParser.class, Map params}}.  I do like 
the dual use of configure with ParseContext to achieve Nick's recommendation 
elegantly.


2) We need to add a {{Map getParams()}} to the {{Configurable}} 
interface so that when we serialize the config to XML, we can remember what the 
params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or 
somewhere else?  I think a configurable (parser? or all configurables?) should 
need to register valid configuration keys at initialization, and then we can 
check the validity of the keys passed in during {{configure()}} once in the 
base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel 
strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration 
information to the wrapped parser in parser wrappers: ParserDecorator, 
DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the 
parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map parameters? Or should we follow, say, 
Solr's syntax for type checking?
{{noformat}}
10
{{noformat}}

B) We could use reflection to get around each parser having to add its own 
configuration code.  We could create a static configurator  that has a 
{{configure(Configurable configurable, Map params}} method.  
That isn't quite right, because we'd have to know the type for each param (see 
above), but something along those lines.  Too complex?

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently confi

[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117
 ] 

Tim Allison commented on TIKA-1508:
---

[~thammegowda], this looks really good. I merged it on a local branch and made 
minimal modifications to the PDFParser to make this work...and it did...very 
straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have 
collisions with different parsers if anyone uses {{configure()}} outside of the 
normal course of events...it is simpler to use Map.  Or, if we 
do use the ParseContext, we should specify which parser the params are for, 
e.g. {{context.set{{PDFParser.class, Map params}}.  I do like 
the dual use of configure with ParseContext to achieve Nick's recommendation 
elegantly.


2) We need to add a {{Map getParams()}} to the {{Configurable}} 
interface so that when we serialize the config to XML, we can remember what the 
params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or 
somewhere else?  I think a configurable (parser? or all configurables?) should 
need to register valid configuration keys at initialization, and then we can 
check the validity of the keys passed in during {{configure()}} once in the 
base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel 
strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration 
information to the wrapped parser in parser wrappers: ParserDecorator, 
DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the 
parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map parameters? Or should we follow, say, 
Solr's syntax for type checking?
{{noformat}}
10
{{noformat}}

B) We could use reflection to get around each parser having to add its own 
configuration code.  We could create a static configurator  that has a 
{{configure(Configurable configurable, Map params}} method.  
That isn't quite right, because we'd have to know the type for each param (see 
above), but something along those lines.  Too complex?

> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)