[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2016-02-01 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127061#comment-15127061
 ] 

Thamme Gowda N commented on TIKA-1816:
--

[~talli...@mitre.org] Sure, I will have a look.

Correct me if I am wrong (as I was little away from 2.x discussions):
The NER is now provided by *tika-parser-advanced-module*, so the tests should 
be set-up over there, am I correct?

> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Tim Allison
>  Labels: memex
> Fix For: 1.12
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-02-01 Thread Sam H (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126248#comment-15126248
 ] 

Sam H commented on TIKA-1841:
-

Hi [~gagravarr],

There has been no reaction to this issue in the past 6 days. Can I assume my 
proposed structure is ok?

I have already started implementing this:
https://github.com/zetisam/tika/tree/TIKA-1841

The PPT code allows you to get the slide-notes-footer and slide-notes-header 
seperately, but the POI code seems to add these fields to the output anyway, so 
I don't know if this is of much use. 

I couldn't find how to do this in PPTX, so maybe this part can be dropped (in 
order not to have duplicate content).

The same for slide footers in general. They seem to be added to the content, so 
having them as a separate div would be duplicating this content.

Any thoughts?

> Different XML output structure for PPT and PPTX
> ---
>
> Key: TIKA-1841
> URL: https://issues.apache.org/jira/browse/TIKA-1841
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> 
> 
>  //optional
>  //optional
> ...
> 
> 
>  //optional
>  //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> 
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-02-01 Thread Giovanni Usai (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126272#comment-15126272
 ] 

Giovanni Usai commented on TIKA-1843:
-

Hi Nick,
Sigrun owner has merged my modifications, so we can go on with the integration.

Do I have to perform the steps as per guide 
(http://central.sonatype.org/pages/ossrh-guide.html) or they will be done by 
you?

Thanks,
Giovanni

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126477#comment-15126477
 ] 

Ian Williams commented on TIKA-1845:


Tim - thanks for confirming what's in the attachment and for the heads up about 
the metadata.  I've attached a new cutdown example that fails with the same 
error.  Please use this sample for unit tests etc.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
> Attachments: example-that-fails.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> 

[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX

2016-02-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126532#comment-15126532
 ] 

Nick Burch commented on TIKA-1841:
--

Ideally we would break out the header and footer into separate divs/paragraphs 
within the slide's contents. If you can tweak the code to do that, please do! 
If only one format makes it easy, do it "right" for that one, and add a TODO 
for the other

Otherwise, assuming no last minute objections (eg from [~talli...@mitre.org]), 
then go ahead with your plan, and submit a pull request once it's all ready + 
unit tested!

> Different XML output structure for PPT and PPTX
> ---
>
> Key: TIKA-1841
> URL: https://issues.apache.org/jira/browse/TIKA-1841
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>
> Issue is slightly related to TIKA-1840
> I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is 
> different. 
> The structure for PPTX seems as follows:
> {code}
> 
> 
>  //optional
>  //optional
> ...
> 
> 
>  //optional
>  //optional
> {code}
> Note that there's no parent slide element to indicate the start and end of 
> each slide.
> For powerpoint the structure is as follows:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> 
> {code}
> In my application, I'm using XPath to get the desired information . As the 
> XML structure is different, I have to differentiate my XPath queries whether 
> the file is PPT (old) or PPTX (new). It would be nice for Tika to return the 
> same XML for both.
> I would propose changing the XML structure to this:
> {code}
> 
>   
> 
> 
>  //added in TIKA-1840
>  
>   
>   ...
>   
> 
> 
>  //added in TIKA-1840
> 
>   
> 
> {code}
> So, essentially, like the current PPT output, but without the list of notes 
> at the end (as this is also omitted for PPTX).
> On the one hand this generalizes PPT(X) handling, on the other it can break 
> existing (external) functionality relying on a specific XML output format.
> I don't know if this is something the project wants fixed or not. If so, I'm 
> willing to donate my time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126479#comment-15126479
 ] 

Tim Allison commented on TIKA-1845:
---

There are two problems that this file reveals.

1) The RTFEmbObjHandler is incorrectly setting the {{CONTENT_TYPE}} to the 
subtype:
{noformat}
 metadata.set(Metadata.CONTENT_TYPE, mediaType.getSubtype());
{noformat}
2) In tika-server's TikaResource, this line 
{noformat}
return MediaType.parse(ct);
{noformat}
can return null.  If we're not able to parse the {{String ct = 
metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE);}}, we should 
not return null, we should go through detection.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
> Attachments: example-that-fails.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126506#comment-15126506
 ] 

Tim Allison commented on TIKA-1845:
---

my failure on TIKA-1010 to set mime correctly.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
> Attachments: example-that-fails.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>  

[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
---
Attachment: example-that-fails.rtf

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
> Attachments: example-that-fails.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> 

[jira] [Assigned] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-1845:
-

Assignee: Tim Allison

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>Assignee: Tim Allison
> Attachments: example-that-fails.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> 

[jira] [Resolved] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-02-01 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1830.
---
Resolution: Fixed
  Assignee: Tim Allison

[~thetaphi], I'm sorry I didn't get this into 1.12.  I'd like to blame 
[snowzilla|https://en.wikipedia.org/wiki/January_2016_United_States_blizzard] 
for keeping me from my dev environment.

[~tilman] and other PDFBox colleagues, thank you for all of your work on this!

Fellow tika-devs, this was my first push to the public git repo.  Let me know 
if I botched anything.

I realize I still have to make this change to the 2.x branch.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.13
>
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-02-01 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1830:
--
Fix Version/s: 1.13

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2016-02-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127009#comment-15127009
 ] 

Tim Allison commented on TIKA-1816:
---

[~thammegowda], if you have a chance, would you be willing to try your hand at 
a patch for the 2x branch?  I'm not having luck.

> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Tim Allison
>  Labels: memex
> Fix For: 1.12
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)
Ian Williams created TIKA-1845:
--

 Summary: Unable to extract content from certain RTFs using 
tika-server versions since 1.5 
 Key: TIKA-1845
 URL: https://issues.apache.org/jira/browse/TIKA-1845
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.11, 1.9, 1.6
 Environment: Windows
Reporter: Ian Williams


I have some patient letters that are RTF documents.  When I extract the text 
from these documents using tika-server-1.5.jar, it works fine.

However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
1.11), it fails with the stack trace and error shown below.

I can provide a sample RTF that is failing.

I wondered whether the error might be related to the following change that was 
introduced in 1.6?:
  * Made RTFParser's list handling slightly more robust against corrupt
list metadata (TIKA-1305)

It's possible that there is some issue with the RTF documents, but they are 
real patient letters and they open in Microsoft Word without any problems.

Many thanks
Ian


Steps to reproduce issue


1. HTTP PUT to Tika server using curl:

C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf 
http://localhost:9998/tika --header "Content-Type: application/rtf" --header 
"Accept: text/plain"

--> this works fine when running tika-server-1.5.jar, but fails with 
tika-server-1.6.jar


2. Screen capture from the server:
INFO: Starting Apache Tika 1.9 server
Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
INFO: Started
Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/rtf)
Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.rtf.RTFParser@32a6dc
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
at 
org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126320#comment-15126320
 ] 

Tim Allison commented on TIKA-1845:
---

>From the stacktrace, this looks to be related to TIKA-1010.  Will take a look 
>shortly.

To attach files to JIRA, {{More->Attach Files}}.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> 

[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
---
Description: 
I have some patient letters that are RTF documents.  When I extract the text 
from these documents using tika-server-1.5.jar, it works fine.

However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
1.11), it fails with the stack trace and error shown below.

I can provide a sample RTF that is failing.  I'm not sure how to attach files 
to this issue so here is a link to an Evernote note containing an example RTF 
that fails:
https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290

I wondered whether the error might be related to the following change that was 
introduced in 1.6?:
  * Made RTFParser's list handling slightly more robust against corrupt
list metadata (TIKA-1305)

It's possible that there is some issue with the RTF documents, but they are 
real patient letters and they open in Microsoft Word without any problems.

Many thanks
Ian


Steps to reproduce issue


1. HTTP PUT to Tika server using curl:

C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf 
http://localhost:9998/tika --header "Content-Type: application/rtf" --header 
"Accept: text/plain"

--> this works fine when running tika-server-1.5.jar, but fails with 
tika-server-1.6.jar


2. Screen capture from the server:
INFO: Starting Apache Tika 1.9 server
Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
INFO: Started
Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/rtf)
Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.rtf.RTFParser@32a6dc
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
at 
org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at 

[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
---
Description: 
I have some patient letters that are RTF documents.  When I extract the text 
from these documents using tika-server-1.5.jar, it works fine.

However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
1.11), it fails with the stack trace and error shown below.

I can provide a sample RTF that is failing.  I'm not sure how to attach files 
to this issue so here is a link to an Evernote note containing an example RTF 
that fails:
http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/

I wondered whether the error might be related to the following change that was 
introduced in 1.6?:
  * Made RTFParser's list handling slightly more robust against corrupt
list metadata (TIKA-1305)

It's possible that there is some issue with the RTF documents, but they are 
real patient letters and they open in Microsoft Word without any problems.

Many thanks
Ian


Steps to reproduce issue


1. HTTP PUT to Tika server using curl:

C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf 
http://localhost:9998/tika --header "Content-Type: application/rtf" --header 
"Accept: text/plain"

--> this works fine when running tika-server-1.5.jar, but fails with 
tika-server-1.6.jar


2. Screen capture from the server:
INFO: Starting Apache Tika 1.9 server
Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
INFO: Started
Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/rtf)
Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.rtf.RTFParser@32a6dc
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
at 
org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at 

[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
---
Attachment: (was: test-anonymised-letter.rtf)

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126317#comment-15126317
 ] 

Nick Burch commented on TIKA-1845:
--

Near the top of the jira page are some buttons, please hit "More" then "Attach 
Files", and then upload the smallest file you have which triggers the issue. We 
can then use that for investigating, testing and (hopefully!) later unit 
testing of fixes.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126354#comment-15126354
 ] 

Ian Williams commented on TIKA-1845:


I've deleted the attachment for the time being - sorry.   Please contact me 
directly for a sample.  The reason is that I don't know what's in the embedded 
file within the RTF.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126375#comment-15126375
 ] 

Ian Williams commented on TIKA-1845:


Just being cautious because I don't want to share anything in a public forum 
that isn't 100% anonymised, and I don't know what's in that embedded file 
within the RTF (doesn't show up within the document in Word).  Possibly a logo 
or something like that.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> 

[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
---
Description: 
I have some patient letters that are RTF documents.  When I extract the text 
from these documents using tika-server-1.5.jar, it works fine.

However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
1.11), it fails with the stack trace and error shown below.

I can provide a sample RTF that is failing. 

I wondered whether the error might be related to the following change that was 
introduced in 1.6?:
  * Made RTFParser's list handling slightly more robust against corrupt
list metadata (TIKA-1305)

It's possible that there is some issue with the RTF documents, but they are 
real patient letters and they open in Microsoft Word without any problems.

Many thanks
Ian


Steps to reproduce issue


1. HTTP PUT to Tika server using curl:

C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf 
http://localhost:9998/tika --header "Content-Type: application/rtf" --header 
"Accept: text/plain"

--> this works fine when running tika-server-1.5.jar, but fails with 
tika-server-1.6.jar


2. Screen capture from the server:
INFO: Starting Apache Tika 1.9 server
Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
INFO: Started
Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/rtf)
Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.rtf.RTFParser@32a6dc
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
at 
org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 

[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
---
Attachment: test-anonymised-letter.rtf

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
> Attachments: test-anonymised-letter.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Ian Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126340#comment-15126340
 ] 

Ian Williams commented on TIKA-1845:


OK - thanks.  I've attached the file now.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
> Attachments: test-anonymised-letter.rtf
>
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126366#comment-15126366
 ] 

Tim Allison commented on TIKA-1845:
---

Scooped it from evernote.  Let me know if I should srm it.

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> 

[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5

2016-02-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126429#comment-15126429
 ] 

Tim Allison commented on TIKA-1845:
---

Looks like there is no trouble with the tika-app with straight extraction or 
with recursive json -J.  If you use the -z option, you'll see the attachment is 
a wmf file, looks like a blank, black banner.  There may be items in the 
metadata, though, that are sensitive (name of original author and organization).

As I look more closely at the stacktrace,  the detector is returning "null" in 
the AutoDetectParser with tika-server, but not with tika-app.  I can trigger 
this stacktrace in our unit tests.  Yay!  So, I think TIKA-1010 made this 
problem visible, but this looks like something is going wrong with the detector 
in tika-server's AutoDetectParser's detector.  

> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> -
>
> Key: TIKA-1845
> URL: https://issues.apache.org/jira/browse/TIKA-1845
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6, 1.9, 1.11
> Environment: Windows
>Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing. 
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
> list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> 
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
> at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
> at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
> at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> 

[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy

2016-02-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126450#comment-15126450
 ] 

Nick Burch commented on TIKA-1843:
--

Ideally you'd work with the Sigrun owner to have them do it - it's best if the 
people who "own" the code and "do" the releases are also the ones who push the 
files to Maven central. (Doesn't have to be, there is the third party process, 
but it's certainly preferred)

If I were you, I'd review the docs, then suggest any POM fixes to them. Once 
those are in, work with the Sigrun team to get them to request their access + 
get things uploaded

If you need an example project to crib from for the pom, my own 
https://github.com/Gagravarr/VorbisJava/blob/master/parent/pom.xml is one place 
you could start (amongst others!)

> Tika parser for SEG-Y files and new MIME type application/segy
> --
>
> Key: TIKA-1843
> URL: https://issues.apache.org/jira/browse/TIKA-1843
> Project: Tika
>  Issue Type: New Feature
>  Components: mime, parser
>Reporter: Giovanni Usai
>Priority: Minor
>
> This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and 
> .sgy). 
> The SEG-Y format is used to store seismic data, you can find more information 
> here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM.
> I have:
> - added a new MIME type application/segy matching the file name extensions 
> .segy, .seg and .sgy.
> - created a new SEGYParser, matching that MIME type.
> In order to parse the SEG-Y files, I am using a modified version of the 
> sigrun code (available under Apache license, here 
> https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and 
> changed some method signatures to be able to read from a ReadableByteChannel 
> instead of FileChannel.
> For the moment I have put it directly into the new Tika's segy package. Is 
> this the right thing to do or should I reference it as external library thus 
> modifying the pom.xml?
> Thanks and best regards,
> Giovanni



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)