[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127061#comment-15127061 ] Thamme Gowda N commented on TIKA-1816: -- [~talli...@mitre.org] Sure, I will have a look. Correct me if I am wrong (as I was little away from 2.x discussions): The NER is now provided by *tika-parser-advanced-module*, so the tests should be set-up over there, am I correct? > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N >Assignee: Tim Allison > Labels: memex > Fix For: 1.12 > > Attachments: TIKA-1816-proxy-fix.patch > > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX
[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126248#comment-15126248 ] Sam H commented on TIKA-1841: - Hi [~gagravarr], There has been no reaction to this issue in the past 6 days. Can I assume my proposed structure is ok? I have already started implementing this: https://github.com/zetisam/tika/tree/TIKA-1841 The PPT code allows you to get the slide-notes-footer and slide-notes-header seperately, but the POI code seems to add these fields to the output anyway, so I don't know if this is of much use. I couldn't find how to do this in PPTX, so maybe this part can be dropped (in order not to have duplicate content). The same for slide footers in general. They seem to be added to the content, so having them as a separate div would be duplicating this content. Any thoughts? > Different XML output structure for PPT and PPTX > --- > > Key: TIKA-1841 > URL: https://issues.apache.org/jira/browse/TIKA-1841 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H > > Issue is slightly related to TIKA-1840 > I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is > different. > The structure for PPTX seems as follows: > {code} > > > //optional > //optional > ... > > > //optional > //optional > {code} > Note that there's no parent slide element to indicate the start and end of > each slide. > For powerpoint the structure is as follows: > {code} > > > > > //added in TIKA-1840 > > > ... > > > > //added in TIKA-1840 > > > > > {code} > In my application, I'm using XPath to get the desired information . As the > XML structure is different, I have to differentiate my XPath queries whether > the file is PPT (old) or PPTX (new). It would be nice for Tika to return the > same XML for both. > I would propose changing the XML structure to this: > {code} > > > > > //added in TIKA-1840 > > > ... > > > > //added in TIKA-1840 > > > > {code} > So, essentially, like the current PPT output, but without the list of notes > at the end (as this is also omitted for PPTX). > On the one hand this generalizes PPT(X) handling, on the other it can break > existing (external) functionality relying on a specific XML output format. > I don't know if this is something the project wants fixed or not. If so, I'm > willing to donate my time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126272#comment-15126272 ] Giovanni Usai commented on TIKA-1843: - Hi Nick, Sigrun owner has merged my modifications, so we can go on with the integration. Do I have to perform the steps as per guide (http://central.sonatype.org/pages/ossrh-guide.html) or they will be done by you? Thanks, Giovanni > Tika parser for SEG-Y files and new MIME type application/segy > -- > > Key: TIKA-1843 > URL: https://issues.apache.org/jira/browse/TIKA-1843 > Project: Tika > Issue Type: New Feature > Components: mime, parser >Reporter: Giovanni Usai >Priority: Minor > > This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and > .sgy). > The SEG-Y format is used to store seismic data, you can find more information > here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM. > I have: > - added a new MIME type application/segy matching the file name extensions > .segy, .seg and .sgy. > - created a new SEGYParser, matching that MIME type. > In order to parse the SEG-Y files, I am using a modified version of the > sigrun code (available under Apache license, here > https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and > changed some method signatures to be able to read from a ReadableByteChannel > instead of FileChannel. > For the moment I have put it directly into the new Tika's segy package. Is > this the right thing to do or should I reference it as external library thus > modifying the pom.xml? > Thanks and best regards, > Giovanni -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126477#comment-15126477 ] Ian Williams commented on TIKA-1845: Tim - thanks for confirming what's in the attachment and for the heads up about the metadata. I've attached a new cutdown example that fails with the same error. Please use this sample for unit tests etc. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > Attachments: example-that-fails.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at >
[jira] [Commented] (TIKA-1841) Different XML output structure for PPT and PPTX
[ https://issues.apache.org/jira/browse/TIKA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126532#comment-15126532 ] Nick Burch commented on TIKA-1841: -- Ideally we would break out the header and footer into separate divs/paragraphs within the slide's contents. If you can tweak the code to do that, please do! If only one format makes it easy, do it "right" for that one, and add a TODO for the other Otherwise, assuming no last minute objections (eg from [~talli...@mitre.org]), then go ahead with your plan, and submit a pull request once it's all ready + unit tested! > Different XML output structure for PPT and PPTX > --- > > Key: TIKA-1841 > URL: https://issues.apache.org/jira/browse/TIKA-1841 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H > > Issue is slightly related to TIKA-1840 > I've noticed that the XML structure of Powerpoint (PPT) and PPTX files is > different. > The structure for PPTX seems as follows: > {code} > > > //optional > //optional > ... > > > //optional > //optional > {code} > Note that there's no parent slide element to indicate the start and end of > each slide. > For powerpoint the structure is as follows: > {code} > > > > > //added in TIKA-1840 > > > ... > > > > //added in TIKA-1840 > > > > > {code} > In my application, I'm using XPath to get the desired information . As the > XML structure is different, I have to differentiate my XPath queries whether > the file is PPT (old) or PPTX (new). It would be nice for Tika to return the > same XML for both. > I would propose changing the XML structure to this: > {code} > > > > > //added in TIKA-1840 > > > ... > > > > //added in TIKA-1840 > > > > {code} > So, essentially, like the current PPT output, but without the list of notes > at the end (as this is also omitted for PPTX). > On the one hand this generalizes PPT(X) handling, on the other it can break > existing (external) functionality relying on a specific XML output format. > I don't know if this is something the project wants fixed or not. If so, I'm > willing to donate my time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126479#comment-15126479 ] Tim Allison commented on TIKA-1845: --- There are two problems that this file reveals. 1) The RTFEmbObjHandler is incorrectly setting the {{CONTENT_TYPE}} to the subtype: {noformat} metadata.set(Metadata.CONTENT_TYPE, mediaType.getSubtype()); {noformat} 2) In tika-server's TikaResource, this line {noformat} return MediaType.parse(ct); {noformat} can return null. If we're not able to parse the {{String ct = metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE);}}, we should not return null, we should go through detection. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > Attachments: example-that-fails.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126506#comment-15126506 ] Tim Allison commented on TIKA-1845: --- my failure on TIKA-1010 to set mime correctly. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > Attachments: example-that-fails.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) >
[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Williams updated TIKA-1845: --- Attachment: example-that-fails.rtf > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > Attachments: example-that-fails.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at >
[jira] [Assigned] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1845: - Assignee: Tim Allison > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams >Assignee: Tim Allison > Attachments: example-that-fails.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at >
[jira] [Resolved] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1830. --- Resolution: Fixed Assignee: Tim Allison [~thetaphi], I'm sorry I didn't get this into 1.12. I'd like to blame [snowzilla|https://en.wikipedia.org/wiki/January_2016_United_States_blizzard] for keeping me from my dev environment. [~tilman] and other PDFBox colleagues, thank you for all of your work on this! Fellow tika-devs, this was my first push to the public git repo. Let me know if I botched anything. I realize I still have to make this change to the 2.x branch. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.13 > > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1830: -- Fix Version/s: 1.13 > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127009#comment-15127009 ] Tim Allison commented on TIKA-1816: --- [~thammegowda], if you have a chance, would you be willing to try your hand at a patch for the 2x branch? I'm not having luck. > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N >Assignee: Tim Allison > Labels: memex > Fix For: 1.12 > > Attachments: TIKA-1816-proxy-fix.patch > > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
Ian Williams created TIKA-1845: -- Summary: Unable to extract content from certain RTFs using tika-server versions since 1.5 Key: TIKA-1845 URL: https://issues.apache.org/jira/browse/TIKA-1845 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.11, 1.9, 1.6 Environment: Windows Reporter: Ian Williams I have some patient letters that are RTF documents. When I extract the text from these documents using tika-server-1.5.jar, it works fine. However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it fails with the stack trace and error shown below. I can provide a sample RTF that is failing. I wondered whether the error might be related to the following change that was introduced in 1.6?: * Made RTFParser's list handling slightly more robust against corrupt list metadata (TIKA-1305) It's possible that there is some issue with the RTF documents, but they are real patient letters and they open in Microsoft Word without any problems. Many thanks Ian Steps to reproduce issue 1. HTTP PUT to Tika server using curl: C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika --header "Content-Type: application/rtf" --header "Accept: text/plain" --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar 2. Screen capture from the server: INFO: Starting Apache Tika 1.9 server Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9998/ Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: jetty-8.y.z-SNAPSHOT Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Started SelectChannelConnector@localhost:9998 Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main INFO: Started Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest INFO: tika (application/rtf) Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse WARNING: tika: Text extraction failed org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126320#comment-15126320 ] Tim Allison commented on TIKA-1845: --- >From the stacktrace, this looks to be related to TIKA-1010. Will take a look >shortly. To attach files to JIRA, {{More->Attach Files}}. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290 > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at >
[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Williams updated TIKA-1845: --- Description: I have some patient letters that are RTF documents. When I extract the text from these documents using tika-server-1.5.jar, it works fine. However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it fails with the stack trace and error shown below. I can provide a sample RTF that is failing. I'm not sure how to attach files to this issue so here is a link to an Evernote note containing an example RTF that fails: https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290 I wondered whether the error might be related to the following change that was introduced in 1.6?: * Made RTFParser's list handling slightly more robust against corrupt list metadata (TIKA-1305) It's possible that there is some issue with the RTF documents, but they are real patient letters and they open in Microsoft Word without any problems. Many thanks Ian Steps to reproduce issue 1. HTTP PUT to Tika server using curl: C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika --header "Content-Type: application/rtf" --header "Accept: text/plain" --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar 2. Screen capture from the server: INFO: Starting Apache Tika 1.9 server Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9998/ Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: jetty-8.y.z-SNAPSHOT Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Started SelectChannelConnector@localhost:9998 Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main INFO: Started Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest INFO: tika (application/rtf) Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse WARNING: tika: Text extraction failed org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at
[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Williams updated TIKA-1845: --- Description: I have some patient letters that are RTF documents. When I extract the text from these documents using tika-server-1.5.jar, it works fine. However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it fails with the stack trace and error shown below. I can provide a sample RTF that is failing. I'm not sure how to attach files to this issue so here is a link to an Evernote note containing an example RTF that fails: http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ I wondered whether the error might be related to the following change that was introduced in 1.6?: * Made RTFParser's list handling slightly more robust against corrupt list metadata (TIKA-1305) It's possible that there is some issue with the RTF documents, but they are real patient letters and they open in Microsoft Word without any problems. Many thanks Ian Steps to reproduce issue 1. HTTP PUT to Tika server using curl: C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika --header "Content-Type: application/rtf" --header "Accept: text/plain" --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar 2. Screen capture from the server: INFO: Starting Apache Tika 1.9 server Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9998/ Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: jetty-8.y.z-SNAPSHOT Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Started SelectChannelConnector@localhost:9998 Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main INFO: Started Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest INFO: tika (application/rtf) Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse WARNING: tika: Text extraction failed org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at
[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Williams updated TIKA-1845: --- Attachment: (was: test-anonymised-letter.rtf) > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126317#comment-15126317 ] Nick Burch commented on TIKA-1845: -- Near the top of the jira page are some buttons, please hit "More" then "Attach Files", and then upload the smallest file you have which triggers the issue. We can then use that for investigating, testing and (hopefully!) later unit testing of fixes. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290 > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126354#comment-15126354 ] Ian Williams commented on TIKA-1845: I've deleted the attachment for the time being - sorry. Please contact me directly for a sample. The reason is that I don't know what's in the embedded file within the RTF. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126375#comment-15126375 ] Ian Williams commented on TIKA-1845: Just being cautious because I don't want to share anything in a public forum that isn't 100% anonymised, and I don't know what's in that embedded file within the RTF (doesn't show up within the document in Word). Possibly a logo or something like that. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at >
[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Williams updated TIKA-1845: --- Description: I have some patient letters that are RTF documents. When I extract the text from these documents using tika-server-1.5.jar, it works fine. However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it fails with the stack trace and error shown below. I can provide a sample RTF that is failing. I wondered whether the error might be related to the following change that was introduced in 1.6?: * Made RTFParser's list handling slightly more robust against corrupt list metadata (TIKA-1305) It's possible that there is some issue with the RTF documents, but they are real patient letters and they open in Microsoft Word without any problems. Many thanks Ian Steps to reproduce issue 1. HTTP PUT to Tika server using curl: C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika --header "Content-Type: application/rtf" --header "Accept: text/plain" --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar 2. Screen capture from the server: INFO: Starting Apache Tika 1.9 server Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9998/ Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: jetty-8.y.z-SNAPSHOT Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Started SelectChannelConnector@localhost:9998 Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main INFO: Started Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest INFO: tika (application/rtf) Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse WARNING: tika: Text extraction failed org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at
[jira] [Updated] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Williams updated TIKA-1845: --- Attachment: test-anonymised-letter.rtf > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > Attachments: test-anonymised-letter.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126340#comment-15126340 ] Ian Williams commented on TIKA-1845: OK - thanks. I've attached the file now. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > Attachments: test-anonymised-letter.rtf > > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126366#comment-15126366 ] Tim Allison commented on TIKA-1845: --- Scooped it from evernote. Let me know if I should srm it. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. I'm not sure how to attach files > to this issue so here is a link to an Evernote note containing an example RTF > that fails: > http://www.evernote.com/l/AEJKADYRJABJWaHMK-Wz7-LPKEpvLdPgopA/ > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at >
[jira] [Commented] (TIKA-1845) Unable to extract content from certain RTFs using tika-server versions since 1.5
[ https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126429#comment-15126429 ] Tim Allison commented on TIKA-1845: --- Looks like there is no trouble with the tika-app with straight extraction or with recursive json -J. If you use the -z option, you'll see the attachment is a wmf file, looks like a blank, black banner. There may be items in the metadata, though, that are sensitive (name of original author and organization). As I look more closely at the stacktrace, the detector is returning "null" in the AutoDetectParser with tika-server, but not with tika-app. I can trigger this stacktrace in our unit tests. Yay! So, I think TIKA-1010 made this problem visible, but this looks like something is going wrong with the detector in tika-server's AutoDetectParser's detector. > Unable to extract content from certain RTFs using tika-server versions since > 1.5 > - > > Key: TIKA-1845 > URL: https://issues.apache.org/jira/browse/TIKA-1845 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6, 1.9, 1.11 > Environment: Windows >Reporter: Ian Williams > > I have some patient letters that are RTF documents. When I extract the text > from these documents using tika-server-1.5.jar, it works fine. > However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and > 1.11), it fails with the stack trace and error shown below. > I can provide a sample RTF that is failing. > I wondered whether the error might be related to the following change that > was introduced in 1.6?: > * Made RTFParser's list handling slightly more robust against corrupt > list metadata (TIKA-1305) > It's possible that there is some issue with the RTF documents, but they are > real patient letters and they open in Microsoft Word without any problems. > Many thanks > Ian > Steps to reproduce issue > > 1. HTTP PUT to Tika server using curl: > C:\Downloads\Apache Tika>curl -X PUT --data-binary > @test-anonymised-letter.rtf http://localhost:9998/tika --header > "Content-Type: application/rtf" --header "Accept: text/plain" > --> this works fine when running tika-server-1.5.jar, but fails with > tika-server-1.6.jar > 2. Screen capture from the server: > INFO: Starting Apache Tika 1.9 server > Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination > INFO: Setting the server's publish address to be http://localhost:9998/ > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: jetty-8.y.z-SNAPSHOT > Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info > INFO: Started SelectChannelConnector@localhost:9998 > Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main > INFO: Started > Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika (application/rtf) > Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@32a6dc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244) > at > org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321) > at > org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164) > at > org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117) > at > org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at >
[jira] [Commented] (TIKA-1843) Tika parser for SEG-Y files and new MIME type application/segy
[ https://issues.apache.org/jira/browse/TIKA-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126450#comment-15126450 ] Nick Burch commented on TIKA-1843: -- Ideally you'd work with the Sigrun owner to have them do it - it's best if the people who "own" the code and "do" the releases are also the ones who push the files to Maven central. (Doesn't have to be, there is the third party process, but it's certainly preferred) If I were you, I'd review the docs, then suggest any POM fixes to them. Once those are in, work with the Sigrun team to get them to request their access + get things uploaded If you need an example project to crib from for the pom, my own https://github.com/Gagravarr/VorbisJava/blob/master/parent/pom.xml is one place you could start (amongst others!) > Tika parser for SEG-Y files and new MIME type application/segy > -- > > Key: TIKA-1843 > URL: https://issues.apache.org/jira/browse/TIKA-1843 > Project: Tika > Issue Type: New Feature > Components: mime, parser >Reporter: Giovanni Usai >Priority: Minor > > This ticket refers to the parsing of SEG-Y files (extensions .seg, .segy and > .sgy). > The SEG-Y format is used to store seismic data, you can find more information > here http://pubs.usgs.gov/of/2001/of01-326/HTML/FILEFORM.HTM. > I have: > - added a new MIME type application/segy matching the file name extensions > .segy, .seg and .sgy. > - created a new SEGYParser, matching that MIME type. > In order to parse the SEG-Y files, I am using a modified version of the > sigrun code (available under Apache license, here > https://github.com/mikhail-aksenov/sigrun). Notably I have done a fix and > changed some method signatures to be able to read from a ReadableByteChannel > instead of FileChannel. > For the moment I have put it directly into the new Tika's segy package. Is > this the right thing to do or should I reference it as external library thus > modifying the pom.xml? > Thanks and best regards, > Giovanni -- This message was sent by Atlassian JIRA (v6.3.4#6332)