[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903340#comment-14903340
 ] 

Nick Burch commented on TIKA-1739:
--

I'm not sure that the cTAKES parser should be creating an AutoDetectParser 
internally. Shouldn't the users be wrapping the ctakes one in auto-detection if 
that's what they want to have happen? (That's how it normally works with other 
parsers)

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903252#comment-14903252
 ] 

Chris A. Mattmann commented on TIKA-1739:
-

OK [~totaro] I implemented your solution (see attached patch). I am still 
getting the same results:

# Server Side
{noformat}
[chipotle:~/src/tika-server] mattmann% sh start-ctakes-tika.sh
log4j: reset attribute= "false".
log4j: Threshold ="null".
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressAppender] additivity to [false].
log4j: Level value for ProgressAppender is  [INFO].
log4j: ProgressAppender level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m].
log4j: Adding appender named [noEolAppender] to category [ProgressAppender].
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressDone] additivity to [false].
log4j: Level value for ProgressDone is  [INFO].
log4j: ProgressDone level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m%n].
log4j: Adding appender named [eolAppender] to category [ProgressDone].
log4j: Level value for root is  [INFO].
log4j: root level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%d{dd MMM  HH:mm:ss} %5p 
%c{1} - %m%n].
log4j: Adding appender named [consoleAppender] to category [root].
22 Sep 2015 12:09:29  INFO TikaServerCli - Starting Apache Tika 1.11-SNAPSHOT 
server
22 Sep 2015 12:09:29  INFO TikaServerCli - Using custom config: 
/Users/mattmann/git/ctakesparser-utils/config/tika-config.xml
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/Users/mattmann/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/local/apache-ctakes-3.2.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory]
Sep 22, 2015 12:09:30 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:/
22 Sep 2015 12:09:30  INFO Server - jetty-8.y.z-SNAPSHOT
22 Sep 2015 12:09:30  INFO AbstractConnector - Started 
SelectChannelConnector@localhost:
22 Sep 2015 12:09:30  INFO TikaServerCli - Started
22 Sep 2015 12:09:35  INFO RecursiveMetadataResource - rmeta/text 
(application/pdf)
22 Sep 2015 12:09:36  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:

22 Sep 2015 12:09:50  INFO TokenizerAnnotatorPTB - Initializing 
org.apache.ctakes.core.ae.TokenizerAnnotatorPTB
22 Sep 2015 12:09:50  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.
22 Sep 2015 12:09:50  INFO ConstituencyParser - Initializing parser...
22 Sep 2015 12:09:53  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
22 Sep 2015 12:09:53  INFO StatusContextAnalyzer - initBoundaryData() called 
for ContextInitializer
22 Sep 2015 12:09:53  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
22 Sep 2015 12:09:53  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer
22 Sep 2015 12:09:54  INFO SentenceDetector - Sentence detector model file: 
org/apache/ctakes/core/sentdetect/sd-med-model.zip
22 Sep 2015 12:09:56  INFO POSTagger - POS tagger model file: 
org/apache/ctakes/postagger/models/mayo-pos.zip
Loading configuration.
Loading feature templates.
Loading model:
.
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
..
Loading model:
.
Loading model:
...
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:

Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:

Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:

Loading model:
.
Loading model:

Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.

[jira] [Comment Edited] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903252#comment-14903252
 ] 

Chris A. Mattmann edited comment on TIKA-1739 at 9/22/15 7:13 PM:
--

OK [~totaro] I implemented your solution (see attached patch). I am still 
getting the same results:

h2. Server Side
{noformat}
[chipotle:~/src/tika-server] mattmann% sh start-ctakes-tika.sh
log4j: reset attribute= "false".
log4j: Threshold ="null".
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressAppender] additivity to [false].
log4j: Level value for ProgressAppender is  [INFO].
log4j: ProgressAppender level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m].
log4j: Adding appender named [noEolAppender] to category [ProgressAppender].
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [ProgressDone] additivity to [false].
log4j: Level value for ProgressDone is  [INFO].
log4j: ProgressDone level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%m%n].
log4j: Adding appender named [eolAppender] to category [ProgressDone].
log4j: Level value for root is  [INFO].
log4j: root level set to INFO
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%d{dd MMM  HH:mm:ss} %5p 
%c{1} - %m%n].
log4j: Adding appender named [consoleAppender] to category [root].
22 Sep 2015 12:09:29  INFO TikaServerCli - Starting Apache Tika 1.11-SNAPSHOT 
server
22 Sep 2015 12:09:29  INFO TikaServerCli - Using custom config: 
/Users/mattmann/git/ctakesparser-utils/config/tika-config.xml
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/Users/mattmann/src/tika-server/target/tika-server-1.11-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/local/apache-ctakes-3.2.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory]
Sep 22, 2015 12:09:30 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:/
22 Sep 2015 12:09:30  INFO Server - jetty-8.y.z-SNAPSHOT
22 Sep 2015 12:09:30  INFO AbstractConnector - Started 
SelectChannelConnector@localhost:
22 Sep 2015 12:09:30  INFO TikaServerCli - Started
22 Sep 2015 12:09:35  INFO RecursiveMetadataResource - rmeta/text 
(application/pdf)
22 Sep 2015 12:09:36  INFO ClearNLPDependencyParserAE - using Morphy analysis? 
true
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:

22 Sep 2015 12:09:50  INFO TokenizerAnnotatorPTB - Initializing 
org.apache.ctakes.core.ae.TokenizerAnnotatorPTB
22 Sep 2015 12:09:50  INFO ContextDependentTokenizerAnnotator - Finite state 
machines loaded.
22 Sep 2015 12:09:50  INFO ConstituencyParser - Initializing parser...
22 Sep 2015 12:09:53  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
22 Sep 2015 12:09:53  INFO StatusContextAnalyzer - initBoundaryData() called 
for ContextInitializer
22 Sep 2015 12:09:53  INFO ContextAnnotator - SCOPE ORDER: [1, 3]
22 Sep 2015 12:09:53  INFO NegationContextAnalyzer - initBoundaryData() called 
for ContextInitializer
22 Sep 2015 12:09:54  INFO SentenceDetector - Sentence detector model file: 
org/apache/ctakes/core/sentdetect/sd-med-model.zip
22 Sep 2015 12:09:56  INFO POSTagger - POS tagger model file: 
org/apache/ctakes/postagger/models/mayo-pos.zip
Loading configuration.
Loading feature templates.
Loading model:
.
Loading configuration.
Loading feature templates.
Loading lexica.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
..
Loading model:
.
Loading model:
...
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:

Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:
...
Loading model:

Loading model:
...
Loading model:
.
Loading model:
.
Loading model:
.
Loading model:

Loading model:
.
Loading model:

Loading model:
.
Loading model:
.

[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903188#comment-14903188
 ] 

Chris A. Mattmann commented on TIKA-1739:
-

Thanks Giuseppe, so I will try this fix now and update the constructor to the 
old way. If it fixes it, I will go ahead and commit. Thank you.

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1741) Include CTAKESConfig.properties within tika-parsers resources by default

2015-09-22 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created TIKA-1741:
--

 Summary: Include CTAKESConfig.properties within tika-parsers 
resources by default
 Key: TIKA-1741
 URL: https://issues.apache.org/jira/browse/TIKA-1741
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.11


It is a PITA to have to grab the example CTAKESConfig.properties file 
graciously provided by [~gostep] as the TIKA-1645 
[attachment|https://issues.apache.org/jira/secure/attachment/12737116/CTAKESConfig.properties].
I propose to have it added to 
{code}tika-parsers/src/main/resources/.../ctakes/CTAKESConfig.properties{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902528#comment-14902528
 ] 

Tim Allison edited comment on TIKA-1737 at 9/22/15 4:16 PM:


bq.  there were many more that just had a single line of error

Try adding this to your jvm invocation 
{{-XX:-OmitStackTraceInFastThrow}}...this might be a Java optimization.


bq. the real issue are the horrendous memory leaks caused whenever a PDFBox 
exception is thrown, that's definitely got worse

Have you done the profiling to determine the memory leaks are caused by 
exceptions being thrown?  That's interesting...


was (Author: talli...@mitre.org):
bq.  there were many more that just had a single line of error

Try adding this to your jvm invocation 
{{-JXX:-OmitStackTraceInFastThrow}}...this might be a Java optimization.


bq. the real issue are the horrendous memory leaks caused whenever a PDFBox 
exception is thrown, that's definitely got worse

Have you done the profiling to determine the memory leaks are caused by 
exceptions being thrown?  That's interesting...

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903123#comment-14903123
 ] 

Giuseppe Totaro commented on TIKA-1739:
---

Hi [~chrismattmann], Hi [~gagravarr],
I looked at the last code of {{CTAKESParser.java}} and I did some experiments 
on my laptop.
Basically, the problem is due to the default constructor of 
{{CTAKESParser.java}}:
{code:java}
/**
 * Wraps the default Parser
 */
public CTAKESParser() {
this(TikaConfig.getDefaultConfig());
}
{code}

To use CTAKESParser, we need to create a specific configuration for 
CTAKESParser (unless we aim at using the parser programmatically), as reported 
in [ctakesparser-utils|https://github.com/chrismattmann/ctakesparser-utils] 
repository.
While parsing, the default constructor of CTAKESParser is used by Tika 
overriding the given configuration at runtime. Therefore, CTAKESParser is only 
"visited" by Tika that will use, instead, the EmptyParser as fallback.

For instance, if we use again the previous default constructor (that does not 
override the given configuration), then we can use properly cTAKES and obtain 
the right metadata:
{code:java}
public CTAKESParser() {
super(new AutoDetectParser());
}
{code}

[~chrismattmann] and [~gagravarr]], I will be really gald to hear your feedback.
Thanks a lot,
Giuseppe

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903389#comment-14903389
 ] 

Nick Burch commented on TIKA-1739:
--

We explicitly don't let you set an {{AutoDetectParser}} in the config, it's 
something you have to choose to use, giving it the parser(s) you want used 
post-detection

In the non-cTAKES case, you get a Composite Parser that'll handle your formats 
(directly/explicitly/via Tika Config xml/via default Tika Config), then give 
that (perhaps implicitly) to {{AutoDetectParser}}. {{AutoDetectParser}} 
identifies the type of the document, then picks the right parser based on the 
type

In the cTAKES case, you get your chosen Composite Parser again, and give that 
to cTAKES (possibly via Tika Config xml, eg in the case above). You now create 
an {{AutoDetectParser}} as before, and give it cTAKES. {{AutoDetectParser}} 
identifies the type, then gives the document *with the type* to cTAKES, as 
cTAKES claims all the mime types. cTAKES then uses its child Composite Parser 
to have the real parsing done, based on the type that {{AutoDetectParser}} 
supplied to it. When that's done, cTAKES then decorates the output.

Or, if you know the type yourself, you give that to cTAKES, which gives it to 
the child Composite Parser for parsing, then decorates the result, with no 
{{AutoDetectParser}} needed

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903422#comment-14903422
 ] 

Alan Burlison commented on TIKA-1737:
-

bq. Re the ArrayIndexOutOfBoundsException - are you using multithreading? I 
wonder if it is possibly related to PDFBOX-2824. That was fixed in the 2.0 
version only.

Yes, the app is MT and it does indeed look very much like PDFBOX-2824

bq. Re the NPE in PDFStreamEngine.java:355 - this is possibly solved in 1.8.11.

OK, thanks - PDFBOX-2987, correct?


> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903391#comment-14903391
 ] 

Nick Burch commented on TIKA-1739:
--

We explicitly don't let you set an {{AutoDetectParser}} in the config, it's 
something you have to choose to use, giving it the parser(s) you want used 
post-detection

In the non-cTAKES case, you get a Composite Parser that'll handle your formats 
(directly/explicitly/via Tika Config xml/via default Tika Config), then give 
that (perhaps implicitly) to {{AutoDetectParser}}. {{AutoDetectParser}} 
identifies the type of the document, then picks the right parser based on the 
type

In the cTAKES case, you get your chosen Composite Parser again, and give that 
to cTAKES (possibly via Tika Config xml, eg in the case above). You now create 
an {{AutoDetectParser}} as before, and give it cTAKES. {{AutoDetectParser}} 
identifies the type, then gives the document *with the type* to cTAKES, as 
cTAKES claims all the mime types. cTAKES then uses its child Composite Parser 
to have the real parsing done, based on the type that {{AutoDetectParser}} 
supplied to it. When that's done, cTAKES then decorates the output.

Or, if you know the type yourself, you give that to cTAKES, which gives it to 
the child Composite Parser for parsing, then decorates the result, with no 
{{AutoDetectParser}} needed

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903390#comment-14903390
 ] 

Nick Burch commented on TIKA-1739:
--

We explicitly don't let you set an {{AutoDetectParser}} in the config, it's 
something you have to choose to use, giving it the parser(s) you want used 
post-detection

In the non-cTAKES case, you get a Composite Parser that'll handle your formats 
(directly/explicitly/via Tika Config xml/via default Tika Config), then give 
that (perhaps implicitly) to {{AutoDetectParser}}. {{AutoDetectParser}} 
identifies the type of the document, then picks the right parser based on the 
type

In the cTAKES case, you get your chosen Composite Parser again, and give that 
to cTAKES (possibly via Tika Config xml, eg in the case above). You now create 
an {{AutoDetectParser}} as before, and give it cTAKES. {{AutoDetectParser}} 
identifies the type, then gives the document *with the type* to cTAKES, as 
cTAKES claims all the mime types. cTAKES then uses its child Composite Parser 
to have the real parsing done, based on the type that {{AutoDetectParser}} 
supplied to it. When that's done, cTAKES then decorates the output.

Or, if you know the type yourself, you give that to cTAKES, which gives it to 
the child Composite Parser for parsing, then decorates the result, with no 
{{AutoDetectParser}} needed

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1742) StackOverflowError parsing a PDF with ExtractInlineImages=true

2015-09-22 Thread Nathan Dire (JIRA)
Nathan Dire created TIKA-1742:
-

 Summary: StackOverflowError parsing a PDF with 
ExtractInlineImages=true
 Key: TIKA-1742
 URL: https://issues.apache.org/jira/browse/TIKA-1742
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.10
Reporter: Nathan Dire


Here's the file:
http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

Code to repro (ExtractInlineImages must be true):
{noformat}
Parser parser = new PDFParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
ContentHandler handler = new DefaultHandler();

config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);

context.set(PDFParserConfig.class, config);
context.set(Parser.class, parser);

InputStream is = new BufferedInputStream(new FileInputStream(args[0]));
try {
  parser.parse(is, handler, metadata, context);
} finally {
  is.close();
}
{noformat}

Error:
{noformat}
Exception in thread "main" java.lang.StackOverflowError
at java.util.LinkedHashMap$Entry.addBefore(LinkedHashMap.java:340)
at java.util.LinkedHashMap$Entry.access$600(LinkedHashMap.java:320)
at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:444)
at java.util.HashMap.addEntry(HashMap.java:888)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
at java.util.HashMap.put(HashMap.java:509)
at org.apache.pdfbox.cos.COSDictionary.setItem(COSDictionary.java:246)
at 
org.apache.pdfbox.pdmodel.common.COSDictionaryMap.convert(COSDictionaryMap.java:206)
at 
org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:331)
at 
org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:310)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1744:
--
Attachment: TIKA-1744.patch

> Use java.nio.file.Path in TikaInputStream
> -
>
> Key: TIKA-1744
> URL: https://issues.apache.org/jira/browse/TIKA-1744
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1744.patch
>
>
> This will provide support for the new api for users who need it, and provide 
> better information in I/O operations, e.g. detailed exception if file cannot 
> be read.
> - used Path and methods in java.nio.file.Files internally 
> - add getPath() method as the counterpart to getFile()
> - modified test to use 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-1739:
-
Comment: was deleted

(was: We explicitly don't let you set an {{AutoDetectParser}} in the config, 
it's something you have to choose to use, giving it the parser(s) you want used 
post-detection

In the non-cTAKES case, you get a Composite Parser that'll handle your formats 
(directly/explicitly/via Tika Config xml/via default Tika Config), then give 
that (perhaps implicitly) to {{AutoDetectParser}}. {{AutoDetectParser}} 
identifies the type of the document, then picks the right parser based on the 
type

In the cTAKES case, you get your chosen Composite Parser again, and give that 
to cTAKES (possibly via Tika Config xml, eg in the case above). You now create 
an {{AutoDetectParser}} as before, and give it cTAKES. {{AutoDetectParser}} 
identifies the type, then gives the document *with the type* to cTAKES, as 
cTAKES claims all the mime types. cTAKES then uses its child Composite Parser 
to have the real parsing done, based on the type that {{AutoDetectParser}} 
supplied to it. When that's done, cTAKES then decorates the output.

Or, if you know the type yourself, you give that to cTAKES, which gives it to 
the child Composite Parser for parsing, then decorates the result, with no 
{{AutoDetectParser}} needed)

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1743) NetworkParser can create Unbounded Number of Threads

2015-09-22 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1743:


 Summary: NetworkParser can create Unbounded Number of Threads
 Key: TIKA-1743
 URL: https://issues.apache.org/jira/browse/TIKA-1743
 Project: Tika
  Issue Type: Bug
Reporter: Bob Paulin


The current NetworkParser class creates new instances of the Thread class which 
each call to parse.  This could create an unbounded number of threads created 
by this class.  I'd suggest replacing this logic with a ThreadPoolExecutor and 
a configurable number of threads.  This will help prevent creating an unbounded 
number of threads and allow the user to tune performance to the hardware.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903537#comment-14903537
 ] 

Tilman Hausherr commented on TIKA-1737:
---

No, PDFBOX-2987 is another one I fixed for you. The NPE in 
PDFStreamEngine.java:355 was (hopefully) fixed in PDFBOX-2935. To test this, 
you'd need to use an 1.8.11 snapshot version.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch updated TIKA-1739:
-
Comment: was deleted

(was: We explicitly don't let you set an {{AutoDetectParser}} in the config, 
it's something you have to choose to use, giving it the parser(s) you want used 
post-detection

In the non-cTAKES case, you get a Composite Parser that'll handle your formats 
(directly/explicitly/via Tika Config xml/via default Tika Config), then give 
that (perhaps implicitly) to {{AutoDetectParser}}. {{AutoDetectParser}} 
identifies the type of the document, then picks the right parser based on the 
type

In the cTAKES case, you get your chosen Composite Parser again, and give that 
to cTAKES (possibly via Tika Config xml, eg in the case above). You now create 
an {{AutoDetectParser}} as before, and give it cTAKES. {{AutoDetectParser}} 
identifies the type, then gives the document *with the type* to cTAKES, as 
cTAKES claims all the mime types. cTAKES then uses its child Composite Parser 
to have the real parsing done, based on the type that {{AutoDetectParser}} 
supplied to it. When that's done, cTAKES then decorates the output.

Or, if you know the type yourself, you give that to cTAKES, which gives it to 
the child Composite Parser for parsing, then decorates the result, with no 
{{AutoDetectParser}} needed)

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1742) StackOverflowError parsing a PDF with ExtractInlineImages=true

2015-09-22 Thread Nathan Dire (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Dire updated TIKA-1742:
--
Description: 
Here's the file:
http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

Code to repro ({{ExtractInlineImages}} must be true):
{noformat}
Parser parser = new PDFParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
ContentHandler handler = new DefaultHandler();

config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);

context.set(PDFParserConfig.class, config);
context.set(Parser.class, parser);

InputStream is = new BufferedInputStream(new FileInputStream(args[0]));
try {
  parser.parse(is, handler, metadata, context);
} finally {
  is.close();
}
{noformat}

Error (infinite recursion in {{extractImages}}):
{noformat}
Exception in thread "main" java.lang.StackOverflowError
at java.util.LinkedHashMap$Entry.addBefore(LinkedHashMap.java:340)
at java.util.LinkedHashMap$Entry.access$600(LinkedHashMap.java:320)
at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:444)
at java.util.HashMap.addEntry(HashMap.java:888)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
at java.util.HashMap.put(HashMap.java:509)
at org.apache.pdfbox.cos.COSDictionary.setItem(COSDictionary.java:246)
at 
org.apache.pdfbox.pdmodel.common.COSDictionaryMap.convert(COSDictionaryMap.java:206)
at 
org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:331)
at 
org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:310)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
{noformat}

  was:
Here's the file:
http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

Code to repro (ExtractInlineImages must be true):
{noformat}
Parser parser = new PDFParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
ContentHandler handler = new DefaultHandler();

config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);

context.set(PDFParserConfig.class, config);
context.set(Parser.class, parser);

InputStream is = new BufferedInputStream(new FileInputStream(args[0]));
try {
  parser.parse(is, handler, metadata, context);
} finally {
  is.close();
}
{noformat}

Error:
{noformat}
Exception in thread "main" java.lang.StackOverflowError
at java.util.LinkedHashMap$Entry.addBefore(LinkedHashMap.java:340)
at java.util.LinkedHashMap$Entry.access$600(LinkedHashMap.java:320)
at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:444)
at java.util.HashMap.addEntry(HashMap.java:888)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
at java.util.HashMap.put(HashMap.java:509)
at org.apache.pdfbox.cos.COSDictionary.setItem(COSDictionary.java:246)
at 
org.apache.pdfbox.pdmodel.common.COSDictionaryMap.convert(COSDictionaryMap.java:206)
at 
org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:331)
at 
org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:310)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
{noformat}


> StackOverflowError parsing a PDF with ExtractInlineImages=true
> --
>
> Key: TIKA-1742
> URL: https://issues.apache.org/jira/browse/TIKA-1742
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
>Reporter: Nathan Dire
>
> Here's the file:
> http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf
> Code to repro ({{ExtractInlineImages}} must be true):
> {noformat}
> Parser parser = new PDFParser();
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> PDFParserConfig config = new PDFParserConfig();
> ContentHandler handler = new DefaultHandler();
> 

[jira] [Commented] (TIKA-1743) NetworkParser can create Unbounded Number of Threads

2015-09-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903878#comment-14903878
 ] 

Tyler Palsulich commented on TIKA-1743:
---

[Copied from the list]

This sounds like a great idea! We should make the size of the pool configurable 
with TikaConfig.

> NetworkParser can create Unbounded Number of Threads
> 
>
> Key: TIKA-1743
> URL: https://issues.apache.org/jira/browse/TIKA-1743
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>
> The current NetworkParser class creates new instances of the Thread class 
> which each call to parse.  This could create an unbounded number of threads 
> created by this class.  I'd suggest replacing this logic with a 
> ThreadPoolExecutor and a configurable number of threads.  This will help 
> prevent creating an unbounded number of threads and allow the user to tune 
> performance to the hardware.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Created] (TIKA-1743) NetworkParser can create Unbounded Number of Threads

2015-09-22 Thread Tyler Palsulich
This sounds like a great idea! We should make the size of the pool
configurable with TikaConfig.

On Tue, Sep 22, 2015, 3:04 PM Bob Paulin (JIRA)  wrote:

> Bob Paulin created TIKA-1743:
> 
>
>  Summary: NetworkParser can create Unbounded Number of Threads
>  Key: TIKA-1743
>  URL: https://issues.apache.org/jira/browse/TIKA-1743
>  Project: Tika
>   Issue Type: Bug
> Reporter: Bob Paulin
>
>
> The current NetworkParser class creates new instances of the Thread class
> which each call to parse.  This could create an unbounded number of threads
> created by this class.  I'd suggest replacing this logic with a
> ThreadPoolExecutor and a configurable number of threads.  This will help
> prevent creating an unbounded number of threads and allow the user to tune
> performance to the hardware.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


[jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1745:
--
Attachment: TIKA-1745.patch

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903670#comment-14903670
 ] 

Chris A. Mattmann commented on TIKA-1739:
-

So, I'm going to take this to the list, but here is the use case:

cTAKESParser should be a parser decorator, which decorates the 
AutoDetectParser. IOW, it lets it do its job, then after parse, and calling 
decorator.parse() it then should be allowed to do its thing by adding in the 
biomedical metadata knowledge to it.

So this is why we set it up before as a decorator. I haven't looked at the code 
to figure out why I need to put DefaultParser as a sub-parser of cTAKESParser 
in the config. This is a change in behavior as to the way we implemented it 
before. Anyways, thanks to you Nick and your suggested update it's working now 
so I am going to close this one off. I also updated the docs here:

https://wiki.apache.org/tika/cTAKESParser

And also updated:

https://github.com/chrismattmann/ctakesparser-utils/

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1739.
-
Resolution: Won't Fix

Nick suggested a work-around, works fine.

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
> Attachments: TIKA-1739.patch
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1746:
--
Attachment: TIKA-1746.patch

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1740) RecursiveParserWrapper returning ContentHandler-s

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902592#comment-14902592
 ] 

Tim Allison commented on TIKA-1740:
---

Oops.  Nick beat me to it.  That was plan B.

[~gagravarr], do you think my proposal would add too much complexity?

> RecursiveParserWrapper returning ContentHandler-s
> -
>
> Key: TIKA-1740
> URL: https://issues.apache.org/jira/browse/TIKA-1740
> Project: Tika
>  Issue Type: Wish
>  Components: core, parser
>Reporter: Andrea
>
> I would like to build a mechanism to allow a custom object being built 
> starting from a parsing result. This can be done easily by working with a 
> custom ContentHandler "transformer", but how can I achieve this result using 
> a RecursiveParserWrapper? In this case I can only set a ContentHandlerFactory 
> and the parser will just call the toString method and set it as a metadata. 
> Can you imagine something to get the entire ContentHandler object for each 
> subfile instead of the result of the toString method? Of course, it would 
> also be needed to have a flag to disable the TIKA_CONTENT metadata production.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902657#comment-14902657
 ] 

Alan Burlison edited comment on TIKA-1737 at 9/22/15 1:51 PM:
--

bq. Could we have done something at the Tika level to cause this...I wonder?

I don't believe so. I think PDFBox is just not cleaning up properly after an 
exception. If you want to 'fix' (?) this at the Tika level I think you'd have 
to do something similar to what I'm doing and create a new PDFBox instance each 
time there's a PDFBox exception.

bq. Does the heap usage jump for every type of exception...that is, if I find 
any old PDF that triggers an exception, do you think I'll see this with Tika 
1.10?

Pretty much. I'm going to try to get a heap dump to work on but that means 
undoing all the workaround code I've added, so it will take a bit for me to do 
that.

bq. Out of curiosity, are you using Tika in the same jvm as Lucene?

Yes, the app is the same as described in TIKA-1471. It's actually a Tomcat 
instance that contains both Lucene indexer and search, where Tika is being used 
for text extraction for the Lucene indexer.



was (Author: alanbur):
.bq Could we have done something at the Tika level to cause this...I wonder?

I don't believe so. I think PDFBox is just not cleaning up properly after an 
exception. If you want to 'fix' (?) this at the Tika level I think you'd have 
to do something similar to what I'm doing and create a new PDFBox instance each 
time there's a PDFBox exception.

.bq Does the heap usage jump for every type of exception...that is, if I find 
any old PDF that triggers an exception, do you think I'll see this with Tika 
1.10?

Pretty much. I'm going to try to get a heap dump to work on but that means 
undoing all the workaround code I've added, so it will take a bit for me to do 
that.

.bq Out of curiosity, are you using Tika in the same jvm as Lucene?

Yes, the app is the same as described in TIKA-1471. It's actually a Tomcat 
instance that contains both Lucene indexer and search, where Tika is being used 
for text extraction for the Lucene indexer.


> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902668#comment-14902668
 ] 

Hudson commented on TIKA-1734:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #852 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/852/])
TIKA-1734 via Yaniv Kunda -- use java.nio.file.Path in TemporaryResources 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev=1704620)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/io/TemporaryResourcesTest.java


> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902601#comment-14902601
 ] 

Nick Burch commented on TIKA-1739:
--

I can't actually use the cTAKES parser on my machine - I tried following the 
setup instructions on the wiki but got stuck on the UMLS part, as NIH ignored 
my request. (I signed up, applied, followed the link to approve the request, 
but then never heard any more back from them)

If you have a VM with cTAKES setup on it that I can log into, I'd be happy to 
take a look. Otherwise, if you can get someone at NIH to look at my request 
from June and approve it, I can try locally!

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1734.
---
Resolution: Fixed

r1704620.

Thank you, [~kunda]!

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902613#comment-14902613
 ] 

Tim Allison commented on TIKA-1726:
---

Thank you, [~kkrugler].  [~kunda], is there enough consensus on this to move 
forward?

> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> -- createTemporaryPath
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, referencing the new method from the old one 
> (using the @see tag) until java.io.File itself is deprecated or otherwise 
> becomes obsolete.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> _tika-parsers:_
> - {{org.apache.tika.parser.ParsingReader}} constructor
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902622#comment-14902622
 ] 

Tim Allison commented on TIKA-1737:
---

Could we have done something at the Tika level to cause this...I wonder?

Does the heap usage jump for every type of exception...that is, if I find any 
old PDF that triggers an exception, do you think I'll see this with Tika 1.10?


Out of curiosity, are you using Tika in the same jvm as Lucene?

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902659#comment-14902659
 ] 

Tim Allison commented on TIKA-1737:
---

bq. dating back as far as 1992

Y, I just confirmed that I can't find any overlapping stacktraces from our 
govdocs1+common crawl corpus.  Thank you for sharing.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902657#comment-14902657
 ] 

Alan Burlison commented on TIKA-1737:
-

.bq Could we have done something at the Tika level to cause this...I wonder?

I don't believe so. I think PDFBox is just not cleaning up properly after an 
exception. If you want to 'fix' (?) this at the Tika level I think you'd have 
to do something similar to what I'm doing and create a new PDFBox instance each 
time there's a PDFBox exception.

.bq Does the heap usage jump for every type of exception...that is, if I find 
any old PDF that triggers an exception, do you think I'll see this with Tika 
1.10?

Pretty much. I'm going to try to get a heap dump to work on but that means 
undoing all the workaround code I've added, so it will take a bit for me to do 
that.

.bq Out of curiosity, are you using Tika in the same jvm as Lucene?

Yes, the app is the same as described in TIKA-1471. It's actually a Tomcat 
instance that contains both Lucene indexer and search, where Tika is being used 
for text extraction for the Lucene indexer.


> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1740) RecursiveParserWrapper returning ContentHandler-s

2015-09-22 Thread Andrea (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902694#comment-14902694
 ] 

Andrea commented on TIKA-1740:
--

Thanks for your reply. Of course I can create my own Recursive parser, but it 
would be almost identical to the RecursiveParserWrapper, just storing the 
ContentHandler instances too. 

> RecursiveParserWrapper returning ContentHandler-s
> -
>
> Key: TIKA-1740
> URL: https://issues.apache.org/jira/browse/TIKA-1740
> Project: Tika
>  Issue Type: Wish
>  Components: core, parser
>Reporter: Andrea
>
> I would like to build a mechanism to allow a custom object being built 
> starting from a parsing result. This can be done easily by working with a 
> custom ContentHandler "transformer", but how can I achieve this result using 
> a RecursiveParserWrapper? In this case I can only set a ContentHandlerFactory 
> and the parser will just call the toString method and set it as a metadata. 
> Can you imagine something to get the entire ContentHandler object for each 
> subfile instead of the result of the toString method? Of course, it would 
> also be needed to have a flag to disable the TIKA_CONTENT metadata production.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-22 Thread Yaniv Kunda
Yes, using getPath() for the getFile() counterpart.
I'll prepare patches in a few hours.
On Sep 22, 2015 4:35 PM, "Tim Allison (JIRA)"  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902613#comment-14902613
> ]
>
> Tim Allison commented on TIKA-1726:
> ---
>
> Thank you, [~kkrugler].  [~kunda], is there enough consensus on this to
> move forward?
>
> > Augment public methods that use a java.io.File with methods that use a
> java.nio.file.Path
> >
> -
> >
> > Key: TIKA-1726
> > URL: https://issues.apache.org/jira/browse/TIKA-1726
> > Project: Tika
> >  Issue Type: Improvement
> >  Components: batch, core, gui, parser, translation
> >Reporter: Yaniv Kunda
> >Priority: Minor
> > Fix For: 1.11
> >
> >
> > In light of Java 7 already EOL, it's high time we add support for the
> new java.nio.file.Path class introduced with it, which, together with
> support methods in java.nio.file.Files and others, provide a better file
> I/O framework than java.io.File.
> > In just two cases, we have public methods in tika that only return a
> File object, and cannot be overloaded, so a different name for the new
> method must be created:
> > - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> > _Suggestions:_
> > -- addTemporaryFile
> > -- addTempFile
> > -- createTempFile
> > -- createTemporaryPath
> > - {{org.apache.tika.io.TikaInputStream#getFile()}}
> > _Suggestions:_
> > -- asFile
> > -- toPath
> > -- getPath
> > In other cases, the methods accept a File as an argument, and should
> remain as tika users might be using them - so an overloaded method that
> accepts a Path instead should be added, referencing the new method from the
> old one (using the @see tag) until java.io.File itself is deprecated or
> otherwise becomes obsolete.
> > Here is the full list of other methods:
> > _tika-app:_
> > - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> > _tika-batch:_
> > - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String,
> HANDLE_EXISTING, String)}}
> > - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> > - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> > -
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> > - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> > - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> > - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> > -
> {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File,
> File)}}
> > - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File,
> File)}}
> > - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> > _tika-core:_
> > - {{org.apache.tika.Tika#detect(File)}}
> > - {{org.apache.tika.Tika#parse(File)}}
> > - {{org.apache.tika.Tika#parseToString(File)}}
> > - {{org.apache.tika.config.TikaConfig}} constructors
> > - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> > - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> > -
> {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> > - {{org.apache.tika.io.TikaInputStream#get(File)}}
> > - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> > _tika-parsers:_
> > - {{org.apache.tika.parser.ParsingReader}} constructor
> > - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> > - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> > - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> > _tika-translate:_
> > -
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
> String[], File)}}
> > Due to lack of evidence, all public methods in public non-test classes
> (and not in tika-example) are deemed part of a public API - although
> there's no formal definition of such.
> > If anyone knows of a public method which isn't accessed publicly and can
> be defined as package-private, or for another reason, please comment.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all 

RE: [DISCUSS] Release Tika 1.11?

2015-09-22 Thread Allison, Timothy B.
Thank _you_ for all of your work in modernizing us.  With your efforts, we'll 
be able to deprecate TikaInputStream#get(PunchCard pc) soon. :)

>>Regarding FilenameUtils.getName() - I believe that its functionality can be 
>>replaced by Path.getFileName() - and in a platform-aware manner, as each JVM 
>>distribution comes with a specific provider implementation for the OS it's 
>>for.

I agree that we should use that anytime we're interacting with the file system. 
 

However, that's actually the problem for paths that are stored within the 
document (say, an embedded resource).  Let's say a user creates a file on 
Windows, the file path information for the embedded file (depending on the 
parser and the file format) may be in Windows-ese, which is a  problem if you 
try to use Path.getFileName() (I think... I haven't actually tested this) on a 
Linux machine.  I have actually tested this with the old File getName(), and it 
did not work cross-platform IIRC.

In short, Tika needs to have the ability to extract the file name from a path 
that was created on any platform (including old Mac and its ":" separator) 
while Tika is running on any platform.

-Original Message-
From: Yaniv Kunda [mailto:yaniv.ku...@answers.com] 
Sent: Monday, September 21, 2015 11:31 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] Release Tika 1.11?

Thanks for the positive spirit!

Regarding FilenameUtils.getName() - I believe that its functionality can be 
replaced by Path.getFileName() - and in a platform-aware manner, as each JVM 
distribution comes with a specific provider implementation for the OS it's for.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, September 21, 2015 14:27
To: dev@tika.apache.org
Subject: RE: [DISCUSS] Release Tika 1.11?

+1, it would be great to move a bit more into EOL'd Java 7 asap.

I'll take TIKA-1734 by tomorrow EDT.

As for the other 2, I'm personally ok waiting for 1.12, but I defer to the dev 
community.

Chris, Nick, Ray, Ken, Konstantin, if you have a chance to chime in on 
TIKA-1726, that might help move things forward.

On TIKA-1706, I share Nick's and Jukka's caution, and I also share Yaniv's 
point about duplication of code, bloat within Tika and missing out on
updates.   Aside from one small bit of code I'd like to keep or perhaps try
to move into commons-io (?)[0], I think I'm now +1 to going forward with
TIKA-1706 in core...unless there is a -1 from the community.

Best,

 Tim


[1] I added some customizations for old MAC OS behavior (treat ":" as file
separator) in FileNameUtils.getName() that I don't want to lose.


-Original Message-
From: Yaniv Kunda [mailto:yaniv.ku...@answers.com]
Sent: Sunday, September 20, 2015 7:15 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] Release Tika 1.11?

I would really like to push the following:

https://issues.apache.org/jira/browse/TIKA-1706 - Bring back commons-io to 
tika-core This requires a decision to re-include commons-io as a dependency of 
tika-core.
All the pros and cons have been already debated, but no decision has been made.

https://issues.apache.org/jira/browse/TIKA-1726 - Augment public methods that 
use a java.io.File with methods that use a java.nio.file.Path Since this adds 
new methods to the public API, I requested the group to make a decision about 
the new names - but have not received something definite.
However, I did create a subtask -
https://issues.apache.org/jira/browse/TIKA-1734 Use java.nio.file.Path in 
TemporaryResources - using [~tallison]'s suggestion, which has not been 
committed yet.

If decisions are made on the above issues, I can quickly create patches for 
them.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Saturday, September 19, 2015 08:10
To: dev@tika.apache.org
Subject: [DISCUSS] Release Tika 1.11?

Hey Guys and Gals,

I’d like to roll a 1.11 release. There is TIKA-1716 which in particular allows 
some neat functionality in tika-python:
https://github.com/chrismattmann/tika-python/pull/67


Anything else to try and get into the release?

If not, I’ll produce an RC #1 by end of weekend.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++

-- 


This email communication (including any attachments) contains information from 
Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained 

[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902528#comment-14902528
 ] 

Tim Allison commented on TIKA-1737:
---

bq.  there were many more that just had a single line of error

Try adding this to your jvm invocation 
{{-JXX:-OmitStackTraceInFastThrow}}...this might be a Java optimization.


bq. the real issue are the horrendous memory leaks caused whenever a PDFBox 
exception is thrown, that's definitely got worse

Have you done the profiling to determine the memory leaks are caused by 
exceptions being thrown?  That's interesting...

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902580#comment-14902580
 ] 

Alan Burlison commented on TIKA-1737:
-

The heap dump is huge and the profiler struggles to cope so I haven't managed 
to do any detailed analysis yet. There is a pool of Tika parser threads that 
are used to handle the corpus, each thread is reused to extract text from 
multiple documents which is then fed into Lucene. With Tika 1.10, every time a 
Tika instance sees an exception from PDFBox the heap usage jumps up and doesn't 
recover, leading to OOM when the index is just a short way through. That 
doesn't happen with Tika 1.5. I've modified the indexer so that rather than 
just logging the Tika exceptions it destroys the relevant Tika instance, does a 
forced GC and then creates a new Tika instance. With Tika 1.10 that keeps the 
heap size within reasonable bounds. To me that seems like pretty conclusive 
proof that PDFBox is leaking when it throws exceptions.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902522#comment-14902522
 ] 

Tim Allison commented on TIKA-1737:
---

Thank you, [~tilman]!

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902567#comment-14902567
 ] 

Tim Allison commented on TIKA-1734:
---

About to commit, unless you'd like to. :)

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1740) RecursiveParserWrapper returning ContentHandler-s

2015-09-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902585#comment-14902585
 ] 

Nick Burch commented on TIKA-1740:
--

You might be better off writing your own Recursion handler. Take a look at how 
things like RecursiveParserWrapper and the Tika App embedded resources 
extractor work, and then do something specialised for your use-case. 
{{RecursiveParserWrapper}} is designed to make things easy for many common uses 
cases, but isn't expected to work for everyone!

> RecursiveParserWrapper returning ContentHandler-s
> -
>
> Key: TIKA-1740
> URL: https://issues.apache.org/jira/browse/TIKA-1740
> Project: Tika
>  Issue Type: Wish
>  Components: core, parser
>Reporter: Andrea
>
> I would like to build a mechanism to allow a custom object being built 
> starting from a parsing result. This can be done easily by working with a 
> custom ContentHandler "transformer", but how can I achieve this result using 
> a RecursiveParserWrapper? In this case I can only set a ContentHandlerFactory 
> and the parser will just call the toString method and set it as a metadata. 
> Can you imagine something to get the entire ContentHandler object for each 
> subfile instead of the result of the toString method? Of course, it would 
> also be needed to have a flag to disable the TIKA_CONTENT metadata production.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1740) RecursiveParserWrapper returning ContentHandler-s

2015-09-22 Thread Andrea (JIRA)
Andrea created TIKA-1740:


 Summary: RecursiveParserWrapper returning ContentHandler-s
 Key: TIKA-1740
 URL: https://issues.apache.org/jira/browse/TIKA-1740
 Project: Tika
  Issue Type: Wish
  Components: core, parser
Reporter: Andrea


I would like to build a mechanism to allow a custom object being built starting 
from a parsing result. This can be done easily by working with a custom 
ContentHandler "transformer", but how can I achieve this result using a 
RecursiveParserWrapper? In this case I can only set a ContentHandlerFactory and 
the parser will just call the toString method and set it as a metadata. Can you 
imagine something to get the entire ContentHandler object for each subfile 
instead of the result of the toString method? Of course, it would also be 
needed to have a flag to disable the TIKA_CONTENT metadata production.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-22 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902553#comment-14902553
 ] 

Bob Paulin commented on TIKA-1734:
--

+1 from me on this [~kunda]

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1740) RecursiveParserWrapper returning ContentHandler-s

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902591#comment-14902591
 ] 

Tim Allison commented on TIKA-1740:
---

How about we store a list of  pairs instead of Metadata 
objects.  The current {{getMetadata()}} would behave as it currently does.

We'll add {{getMetadataAndHandlers()}}, which would return the list of 
 pairs.  This would not include TIKA_CONTENT.

The current {{getMetadata}} will call {{getMetadataAndHandlers}} under the hood 
and add TIKA_CONTENT.  An initial concern is that this will double memory at 
the time that {{getMetadata}} is called, but as I think about the way the 
recursion is working, we're pretty much doing that now.

How does this sound?
  

> RecursiveParserWrapper returning ContentHandler-s
> -
>
> Key: TIKA-1740
> URL: https://issues.apache.org/jira/browse/TIKA-1740
> Project: Tika
>  Issue Type: Wish
>  Components: core, parser
>Reporter: Andrea
>
> I would like to build a mechanism to allow a custom object being built 
> starting from a parsing result. This can be done easily by working with a 
> custom ContentHandler "transformer", but how can I achieve this result using 
> a RecursiveParserWrapper? In this case I can only set a ContentHandlerFactory 
> and the parser will just call the toString method and set it as a metadata. 
> Can you imagine something to get the entire ContentHandler object for each 
> subfile instead of the result of the toString method? Of course, it would 
> also be needed to have a flag to disable the TIKA_CONTENT metadata production.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1739) cTAKESParser doesn't work in 1.11

2015-09-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902823#comment-14902823
 ] 

Chris A. Mattmann commented on TIKA-1739:
-

Nick I wonder if the approval got lost in email or in bulk. Can you contact the 
NIH here:

https://uts.nlm.nih.gov//license.html

and try signing up again? My account was approved in < 1 hour when I initially 
did it, so I wonder if their mail servers were messed up or if the mail got 
lost in SPAM.

Also I do have a VM that you log in and try the cTAKES set up. I'll contact 
offline about that.

> cTAKESParser doesn't work in 1.11
> -
>
> Key: TIKA-1739
> URL: https://issues.apache.org/jira/browse/TIKA-1739
> Project: Tika
>  Issue Type: Bug
>  Components: parser, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> Tika cTAKESParser integration doesn't work in 1.11. The parser is called, but 
> blank metadata comes back:
> {noformat}
> curl -T test.txt -H "Content-Type: text/plain" 
> http://localhost:/rmeta/text
> [{"Content-Type":"text/plain","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ctakes.CTAKESParser","org.apache.tika.parser.EmptyParser"],"X-TIKA:parse_time_millis":"20371","ctakes:schema":"coveredText:start:end:ontologyConceptArr"}
> {noformat}
> [~gagravarr] I wonder if something that happened in TIKA-1653 broke it?
> http://svn.apache.org/viewvc?view=revision=1684199
> [~gostep] can you help me look here?
> I'm working on 
> https://github.com/chrismattmann/shangridocs/tree/convert-wicket which is 
> where I first saw this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902835#comment-14902835
 ] 

Tim Allison commented on TIKA-1737:
---

See PDFBOX-2986 for a resource leak discovered through testing against a file 
in Common Crawl that triggered a ttfparser exception that was close to some of 
yours.  I think this didn't affect you because your ttf exceptions are 
triggered within a PDFFile, and the MemoryTTFDataStream would have been used.

bq. It's actually a Tomcat instance that contains both Lucene indexer and 
search, where Tika is being used for text extraction for the Lucene indexer.

Ah, ok, that's right.  Apologies for the repetition with my soapbox in 
TIKA-1471...I realize this is the easiest way to build an app, but Tika can run 
into serious problems, and I'd strongly encourage trying to keep Tika out of 
the same JVM as Lucene if at all possible.  This is not to say we shouldn't fix 
Tika and its dependencies when problems are found!

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)