[jira] [Commented] (TIKA-1126) text/html procuder for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667707#comment-13667707 ] Ali Mosavian commented on TIKA-1126: Thanks Dave! text/html procuder for tika-server -- Key: TIKA-1126 URL: https://issues.apache.org/jira/browse/TIKA-1126 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.4 Reporter: Ali Mosavian Priority: Trivial Fix For: 1.4 Attachments: tika_server_html_output.patch the /tika resource handler of tika-server can only produce text/plain. This patch adds support for producing text/html. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
tika pull request: Similar to TIKA-1126, this commit adds the ability to pr...
GitHub user stdexcept opened a pull request: https://github.com/apache/tika/pull/3 Similar to TIKA-1126, this commit adds the ability to produce text/xml t... ...o tika-server. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stdexcept/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/3.patch commit 59db45256f9c2c791894e0b9d11e7c6d4f7ce78d Author: alim ali.mosav...@euroling.se Date: 2013-05-27T14:22:49Z Similar to TIKA-1126, this commit adds the ability to produce text/xml to tika-server.
[jira] [Updated] (TIKA-1046) Get java.util.zip.ZipException: unknown compression method when indexing ppf97-file containing wmf-image
[ https://issues.apache.org/jira/browse/TIKA-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1046: Component/s: parser Get java.util.zip.ZipException: unknown compression method when indexing ppf97-file containing wmf-image -- Key: TIKA-1046 URL: https://issues.apache.org/jira/browse/TIKA-1046 Project: Tika Issue Type: Bug Components: parser Reporter: Olof Jonasson Attachments: ppt2000_working.ppt, ppt2010_working.ppt, ppt97_failing.ppt With solr4.0 and tika1.2 we get an exeption when trying to index a powerpoint file that contains a specific .wmf-image. As it seems, the powerpoint file must be created in Office97 (or older?) to generate the error, since re-saving the file in Office2000 or Office2010 makes the problem go away. Full stacktrace from the solr-server below: 2012-dec-19 14:39:46 org.apache.solr.common.SolrException log ALLVARLIG: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@12f195 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@12f195 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 19 more Caused by: org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: unknown compression method at org.apache.poi.hslf.blip.WMF.getData(WMF.java:65) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:204) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:162) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:189) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 23 more Caused by: java.util.zip.ZipException: unknown compression method at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:147) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.poi.hslf.blip.WMF.getData(WMF.java:59) ... 28 more 2012-dec-19 14:39:46 org.apache.solr.common.SolrException log ALLVARLIG: null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from
[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1079: Component/s: parser Word document hits AIOOBE in SummaryExtractor.parseSummaries Key: TIKA-1079 URL: https://issues.apache.org/jira/browse/TIKA-1079 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.4 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc I'm not yet sure if this is a corrupted document (though, MS Word opens it just fine) or a bug in POI ... but I hit this exc when running it through TikaCLI: {noformat} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161) at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163) at org.apache.poi.hpsf.Property.init(Property.java:164) at org.apache.poi.hpsf.Section.init(Section.java:277) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1067) Tika extracts non-existent asterisks (*) from .ppt files
[ https://issues.apache.org/jira/browse/TIKA-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1067: Component/s: parser Tika extracts non-existent asterisks (*) from .ppt files Key: TIKA-1067 URL: https://issues.apache.org/jira/browse/TIKA-1067 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless I created a new blank presentation, put in title + subtitle, saved it as .ppt, and then ran TikaCLI -t: {noformat} bodydiv class=slideShowdiv class=slidep class=slide-master-content*br/ *br/ /p p class=slide-contentTestingbr/ testingbr/ /p /div /div div class=slideNotes/ {noformat} The two extra *'s seem to be coming from the master slide, but I'm not sure which text runs they are and how to stop them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1078: Component/s: parser cli TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Components: cli, parser Reporter: Michael McCandless Fix For: 1.4 Attachments: T-DS_Excel2003-PPT2003_1.xls Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1111) Class loading issues when running in OSGi environment
[ https://issues.apache.org/jira/browse/TIKA-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-: Component/s: packaging Class loading issues when running in OSGi environment - Key: TIKA- URL: https://issues.apache.org/jira/browse/TIKA- Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.3 Environment: Tika 1.3 (tika-core and tika-bundle OSGi bundles) Felix 2.0.5 Reporter: Niels Beekman When dom4j is on the system classpath, a class loading error occurs during detection of Office Open XML files: java.lang.ExceptionInInitializerError at org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.clinit(PackagePropertiesUnmarshaller.java:49) at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:154) at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:141) at org.apache.poi.openxml4j.opc.Package.init(Package.java:54) at org.apache.poi.openxml4j.opc.ZipPackage.init(ZipPackage.java:99) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:207) at org.apache.tika.parser.pkg.ZipContainerDetector.detectOfficeOpenXML(ZipContainerDetector.java:194) at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:134) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:77) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.ClassCastException: org.dom4j.DocumentFactory cannot be cast to org.dom4j.DocumentFactory at org.dom4j.DocumentFactory.getInstance(DocumentFactory.java:97) at org.dom4j.tree.AbstractNode.clinit(AbstractNode.java:39) ... 14 more As a workaround (maybe a solution), I modified the context classloader when running the detection (wrapped the detector and parser). This appears to be the common fix for dom4j, as it uses the context classloader during initialization. Ideally, the detectors and parsers would be running with their original loader (from ServiceLoader) as context class loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1045) Unsupported AutoCAD drawing version: AC1014
[ https://issues.apache.org/jira/browse/TIKA-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1045: Component/s: parser Unsupported AutoCAD drawing version: AC1014 --- Key: TIKA-1045 URL: https://issues.apache.org/jira/browse/TIKA-1045 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Szasz Tamas Attachments: autocad_example [#|2012-12-19T15:35:24.297+0100|SEVERE|glassfish3.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=Thread-1;|org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: AC1014 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:98) at com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:91) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:162) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:326) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:227) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:170) at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:822) at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:719) at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1013) at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:225) at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90) at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79) at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54) at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59) at com.sun.grizzly.ContextTask.run(ContextTask.java:71) at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532) at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: AC1014 at org.apache.tika.parser.dwg.DWGParser.parse(DWGParser.java:126) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 32 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-988: --- Component/s: parser We don't extract a placeholder for a Word document embedded in an Excel document Key: TIKA-988 URL: https://issues.apache.org/jira/browse/TIKA-988 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Fix For: 1.4 Attachments: bug31373.xls In TIKA-956 we fixed the Word parser so that at the point where an embedded document appears, we output a div class=embedded id=_XXX/ tag. It would be nice to do this for documents embedded in Excel too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1102) Can we add div to the list of heuristics for bad html fragments?
[ https://issues.apache.org/jira/browse/TIKA-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1102: Component/s: parser Can we add div to the list of heuristics for bad html fragments? -- Key: TIKA-1102 URL: https://issues.apache.org/jira/browse/TIKA-1102 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2, 1.3 Environment: I'm using Solr 4.0 final with tika v1.2 and ManifoldCF v1.2dev all on tomcat 7.0.37 Reporter: David Morana Good morning, Crawling legacy sites with poorly written html fragments causes severe Solr Xml parse errors and in turn causes ManifoldCF to abort. Can we add div to the list of heuristics so the html parser is used instead of the xml parser? see this ticket for further information: [TIKA-1101|https://issues.apache.org/jira/browse/TIKA-1101] Thank you, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1108) Represent individual slides in pptx
[ https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1108: Component/s: parser Represent individual slides in pptx --- Key: TIKA-1108 URL: https://issues.apache.org/jira/browse/TIKA-1108 Project: Tika Issue Type: Improvement Components: parser Reporter: Daniel Bonniot de Ruisselet Fix For: 1.4 When parsing ppt, tika produces for each slide: div class=slide However for pptx these seem to be missing, all the text is directly under body. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1107) Can't parse velocity file
[ https://issues.apache.org/jira/browse/TIKA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1107: Component/s: parser Can't parse velocity file - Key: TIKA-1107 URL: https://issues.apache.org/jira/browse/TIKA-1107 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: openjdk-1.7.0_17 Reporter: Jorge Urdaneta Attachments: events-detail.vtl When I parse some VTL (velocity) files I get an error 2013-04-15 22:39:56,488 ERROR com.dotcms.tika.TikaUtils - Could not parse file metadata for file : /home/jorgeu/dotcms/dotcms/tomcat/webapps/../../dotCMS/assets/5/a/5a533adc-818f-4f55-a448-622bb90b576c/fileAsset/events-detail.vtl org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.dotcms.tika.TikaUtils.getMetaDataMap(TikaUtils.java:41) at com.dotcms.tika.TikaUtils.getMetaDataMap(TikaUtils.java:85) at com.dotmarketing.portlets.fileassets.business.FileAssetAPIImpl.getMetaDataMap(FileAssetAPIImpl.java:202) at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.checkin(ESContentletAPIImpl.java:2409) at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.checkin(ESContentletAPIImpl.java:1946) at com.dotmarketing.portlets.contentlet.business.ContentletAPIInterceptor.checkin(ContentletAPIInterceptor.java:169) at com.dotmarketing.portlets.contentlet.business.web.ContentletWebAPIImpl._saveWebAsset(ContentletWebAPIImpl.java:495) at com.dotmarketing.portlets.contentlet.business.web.ContentletWebAPIImpl.saveContent(ContentletWebAPIImpl.java:129) at com.dotmarketing.portlets.contentlet.ajax.ContentletAjax.saveContent(ContentletAjax.java:1321) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.directwebremoting.impl.CreatorModule$1.doFilter(CreatorModule.java:229) at org.directwebremoting.impl.CreatorModule.executeMethod(CreatorModule.java:241) at org.directwebremoting.impl.DefaultRemoter.execute(DefaultRemoter.java:379) at org.directwebremoting.impl.DefaultRemoter.execute(DefaultRemoter.java:332) at org.directwebremoting.dwrp.BaseCallHandler.handle(BaseCallHandler.java:104) at org.directwebremoting.servlet.UrlProcessor.handle(UrlProcessor.java:120) at org.directwebremoting.servlet.DwrServlet.doPost(DwrServlet.java:141) at javax.servlet.http.HttpServlet.service(HttpServlet.java:637) at javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.tuckey.web.filters.urlrewrite.UrlRewriteFilter.doFilter(UrlRewriteFilter.java:404) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.dotmarketing.filters.CMSFilter.doFilter(CMSFilter.java:122) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.dotmarketing.filters.AutoLoginFilter.doFilter(AutoLoginFilter.java:61) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.dotmarketing.filters.CacheImagesFilter.doFilter(CacheImagesFilter.java:47) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.dotmarketing.cms.urlmap.filters.URLMapFilter.doFilter(URLMapFilter.java:87) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at
[jira] [Updated] (TIKA-1057) document content property Status is not extracted for *.doc files
[ https://issues.apache.org/jira/browse/TIKA-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1057: Component/s: parser document content property Status is not extracted for *.doc files --- Key: TIKA-1057 URL: https://issues.apache.org/jira/browse/TIKA-1057 Project: Tika Issue Type: Bug Components: parser Environment: java 1.5/1.6 / Windows 7 Reporter: Thomas Stroeter Priority: Minor I would like to use Tika to extract the document property Status from a word 97-2003 *.doc file. Tika dumps the document status property correctly from the xml *.docx files as Content-Status and cp:contentStatus, but I can not extract the metadata from a *.doc Word documents using Tika. Nevertheless Word 2010 has no problem to set and extract that document meta data from a *.doc file. Is there a way to extract these information by Tika for *.doc files, too? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-978) OSGi bundle build fails if space exists in build path
[ https://issues.apache.org/jira/browse/TIKA-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-978: --- Component/s: packaging OSGi bundle build fails if space exists in build path - Key: TIKA-978 URL: https://issues.apache.org/jira/browse/TIKA-978 Project: Tika Issue Type: Bug Components: packaging Reporter: Ken Krugler Priority: Minor While trying to replicate TIKA-997, I copied the Tika 1.2 source release to /Volumes/Ken Backup/. Tika parent/core/parsers/XMP/application built fine, but the OSGi bundle build failed a test - something doesn't like a space in the path to the tika-core.jar file: Running org.apache.tika.bundle.BundleIT 35 [main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System (Version: 2.2.0) created. Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.152 sec FAILURE! Results : Tests in error: initializationError(org.apache.tika.bundle.BundleIT): Illegal character in path at index 17: file:/Volumes/Ken Backup/tika-1.2/tika-bundle/target/test-bundles/tika-core.jar Tests run: 1, Failures: 0, Errors: 1, Skipped: 0 [INFO] [INFO] --- maven-failsafe-plugin:2.10:verify (default) @ tika-bundle --- [INFO] Failsafe report directory: /Volumes/Ken Backup/tika-1.2/tika-bundle/target/failsafe-reports [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent SUCCESS [2.218s] [INFO] Apache Tika core .. SUCCESS [19.498s] [INFO] Apache Tika parsers ... SUCCESS [1:00.914s] [INFO] Apache Tika XMP ... SUCCESS [1.895s] [INFO] Apache Tika application ... SUCCESS [13.102s] [INFO] Apache Tika OSGi bundle ... FAILURE [18.073s] [INFO] Apache Tika server SKIPPED [INFO] Apache Tika ... SKIPPED -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-993) Language Detection Fault
[ https://issues.apache.org/jira/browse/TIKA-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-993: --- Component/s: languageidentifier Language Detection Fault Key: TIKA-993 URL: https://issues.apache.org/jira/browse/TIKA-993 Project: Tika Issue Type: Bug Components: languageidentifier Reporter: Iman Reihanian Attachments: DetectorImpl.java This text's language is English but it detects as Italy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1076) Upgrade to Apache POI 3.9
[ https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1076: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Upgrade to Apache POI 3.9 - Key: TIKA-1076 URL: https://issues.apache.org/jira/browse/TIKA-1076 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Nick Burch Fix For: 1.5 We should upgrade to Apache POI 3.9, which is the latest version -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-961) No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)
[ https://issues.apache.org/jira/browse/TIKA-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-961: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true) -- Key: TIKA-961 URL: https://issues.apache.org/jira/browse/TIKA-961 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Markus Jelsma Assignee: Ken Krugler Fix For: 1.5 Attachments: TIKA-961-1.3-1.patch, TIKA-961-1.3-2.patch, TIKA-961-1.3-3.patch ignorableWhitespace is not properly added when using the BoilerpipeContentHandler and if markus is included. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-539: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Encoding detection is too biased by encoding in meta tag Key: TIKA-539 URL: https://issues.apache.org/jira/browse/TIKA-539 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 0.8, 0.9, 0.10 Reporter: Reinhard Schwab Assignee: Ken Krugler Fix For: 1.5 Attachments: TIKA-539_2.patch, TIKA-539.patch if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = htmlhead\n + meta http-equiv=\content-type\ content=\application/xhtml+xml; charset=iso-8859-1\ / + /headbodyÜber den Wolken\n/body/html; /** * @param args * @throws IOException * @throws TikaException * @throws SAXException */ public static void main(String[] args) throws IOException, SAXException, TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/html); metadata.set(Metadata.CONTENT_ENCODING, UTF-8); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes(UTF-8)); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(1); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)
[ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1109: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Metadata not extracted before the context in OOXML (pptx) - Key: TIKA-1109 URL: https://issues.apache.org/jira/browse/TIKA-1109 Project: Tika Issue Type: Bug Components: parser Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.5 It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first. As a symptom: java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx outputs only as metadata: meta name=Content-Length content=36518/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=resourceName content=testPPT.pptx/ while there is more medata in the file (e.g. dc:titleAttachment Test/dc:title). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.5 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-980: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. MicrodataContentHandler for Apache Tika --- Key: TIKA-980 URL: https://issues.apache.org/jira/browse/TIKA-980 Project: Tika Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Ken Krugler Fix For: 1.5 Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch ContentHandler for Apache Tika capable of building a data structure containing Microdata item scopes and item properties. The Item* classes are borrowed from the Apache Any23 project and are slightly modified to accomodate this SAX-based extractor vs the original DOM-based extractor. The provided unit test outputs two item scopes about the Europe and NA ApacheCon events and each has a nested property. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-817: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. (PPT/PPTX) Missing date/time in text content. - Key: TIKA-817 URL: https://issues.apache.org/jira/browse/TIKA-817 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.5 Missing date/time text in text content for PPT and PPTX files. The date and time are missing from the text content. This occurs when one chooses the following with MS-PowerPoint 2010: 1) Insert 2) Date Time 3) Update automatically 4) save to PPT or PPTX -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-820) Locator is unset for HTML parser
[ https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-820: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Locator is unset for HTML parser Key: TIKA-820 URL: https://issues.apache.org/jira/browse/TIKA-820 Project: Tika Issue Type: Bug Components: general, parser Affects Versions: 1.0 Reporter: Daniel Bonniot de Ruisselet Assignee: Ken Krugler Labels: patch Fix For: 1.5 Attachments: text-locator.patch The HtmlParser does not call setDocumentLocator(Locator locator) on the user's content handler. Patch and unit test attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1106: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: Wish Components: general Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Priority: Minor Labels: entity, geospatial Fix For: 1.5 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1108) Represent individual slides in pptx
[ https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1108: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Represent individual slides in pptx --- Key: TIKA-1108 URL: https://issues.apache.org/jira/browse/TIKA-1108 Project: Tika Issue Type: Improvement Components: parser Reporter: Daniel Bonniot de Ruisselet Fix For: 1.5 When parsing ppt, tika produces for each slide: div class=slide However for pptx these seem to be missing, all the text is directly under body. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-776) ExifTool Embedder
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-776: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. ExifTool Embedder - Key: TIKA-776 URL: https://issues.apache.org/jira/browse/TIKA-776 Project: Tika Issue Type: New Feature Components: metadata Affects Versions: 1.0 Environment: ExifTool is required (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: embed, exiftool, patch Fix For: 1.5 Attachments: tika-parsers-exiftool-embed-patch.txt This patch adds an ExifTool ExternalEmbedder which builds upon the work in issue TIKA-774 and TIKA-775. In the tika-parsers an ExiftoolExternalEmbedder is added which extends ExternalEmbedder to programmatically create an Embedder which calls the ExifTool command line to embed tika metadata into a file stream and an ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and XMP fields then parses the resulting file stream to verify the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1127) text/xml for tika-server
Chris A. Mattmann created TIKA-1127: --- Summary: text/xml for tika-server Key: TIKA-1127 URL: https://issues.apache.org/jira/browse/TIKA-1127 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.4 [~amosavian] contributed this patch from Github to provide text/xml to tika-server: https://github.com/apache/tika/pull/3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-988: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. We don't extract a placeholder for a Word document embedded in an Excel document Key: TIKA-988 URL: https://issues.apache.org/jira/browse/TIKA-988 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Fix For: 1.5 Attachments: bug31373.xls In TIKA-956 we fixed the Word parser so that at the point where an embedded document appears, we output a div class=embedded id=_XXX/ tag. It would be nice to do this for documents embedded in Excel too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1122) Tika fails to parse chm files
[ https://issues.apache.org/jira/browse/TIKA-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1122: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Tika fails to parse chm files - Key: TIKA-1122 URL: https://issues.apache.org/jira/browse/TIKA-1122 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Tejas Patil Priority: Minor Fix For: 1.5 (reported by Jan Riewe over nutch user group, see http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html) Nutch fails to parse chm files with ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp Even after running tika-app in standalone manner (ie. not via nutch), I could see not even a single chm file being parsed (I tried with 10-15 different chm files of variable sizes). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-985) Support for HTML5 elements
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-985: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Support for HTML5 elements -- Key: TIKA-985 URL: https://issues.apache.org/jira/browse/TIKA-985 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2 Reporter: Markus Jelsma Fix For: 1.5 Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, TIKA-985-1.3-3.patch TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, section). This prevents some custom ContentHandlers from reading expected elements and/or attributes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-891: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Fix For: 1.5 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1079: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Word document hits AIOOBE in SummaryExtractor.parseSummaries Key: TIKA-1079 URL: https://issues.apache.org/jira/browse/TIKA-1079 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.5 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc I'm not yet sure if this is a corrupted document (though, MS Word opens it just fine) or a bug in POI ... but I hit this exc when running it through TikaCLI: {noformat} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161) at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163) at org.apache.poi.hpsf.Property.init(Property.java:164) at org.apache.poi.hpsf.Section.init(Section.java:277) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1110) Incorrectly declared SUPPORTED_TYPES in ChmParser.
[ https://issues.apache.org/jira/browse/TIKA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1110: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Incorrectly declared SUPPORTED_TYPES in ChmParser. -- Key: TIKA-1110 URL: https://issues.apache.org/jira/browse/TIKA-1110 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3, 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 [This link|http://www.iana.org/assignments/media-types/application/vnd.ms-htmlhelp] assigns the official mime type for these files to application/vnd.ms-htmlhelp. In the wild there are also two other types used: * application/chm * application/x-chm tika-mimetypes.xml uses the correct official mime type, but ChmParser declares that it supports only application/chm. For this reason content that uses the official mime type (e.g. coming via Detector or parsed using AutoDetectParser, or simply declared in metadata) fails to parse due to unknown mime type. The fix seems simple - ChmParser should declare also all of the above types in its SUPPORTED_TYPES. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element
[ https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-995: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. XHTMLContentHandler doesn't pass attributes of body element --- Key: TIKA-995 URL: https://issues.apache.org/jira/browse/TIKA-995 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Markus Jelsma Fix For: 1.5 Attachments: TIKA-995-1.3-1.patch, TIKA-995-unit.patch XHTMLContentHandler.startElement() uses lazyHead() for the body element because it's defined in the AUTO Set. As a consequence, attributes of the body element are not passed to downstream content handlers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-819: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Make Option to Exclude Embedded Files' Text for Text Content Key: TIKA-819 URL: https://issues.apache.org/jira/browse/TIKA-819 Project: Tika Issue Type: New Feature Components: general Affects Versions: 1.0 Environment: Windows-7 + JDK 1.6 u26 Reporter: Albert L. Fix For: 1.5 It would be nice to be able to disable text content from embedded files. For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1086) Tika-bundle 1.3 does not import org.w3c.dom package
[ https://issues.apache.org/jira/browse/TIKA-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1086: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Tika-bundle 1.3 does not import org.w3c.dom package --- Key: TIKA-1086 URL: https://issues.apache.org/jira/browse/TIKA-1086 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Gaurav Fix For: 1.5 Attachments: TIKA-1086.svn.diff The tika-bundle 1.3 version does not import org.w3c.dom package, as a result it is not able to parse DOM based documents such as Microsoft Word (docx) documents. This issue does not have in version 1.2 as it does import the necessary package and therefore the parsing of the documents work fine. Can someone please look into the issue, as Microsoft Word is a very popular document. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-605) Tika GDAL parser
[ https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-605: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Tika GDAL parser Key: TIKA-605 URL: https://issues.apache.org/jira/browse/TIKA-605 Project: Tika Issue Type: New Feature Components: parser Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: gdal, gsoc2013, integration, mentor, tika Fix For: 1.5 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, TIKA-605.Mattmann.092511.patch.txt Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser around GDAL. See here: http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
[ https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1059: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Better Handling of InterruptedException in ExternalParser and ExternalEmbedder -- Key: TIKA-1059 URL: https://issues.apache.org/jira/browse/TIKA-1059 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Fix For: 1.5 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch {{InterruptedException}} and ignore it. The methods should either call {{interrupt()}} on the current thread or re-throw the exception, possibly wrapped in a {{TikaException}}. See TIKA-775 for a previous discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1072: Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.5 Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-774: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, newbie, patch, Fix For: 1.5 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1127) text/xml for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1127. - Resolution: Fixed - patch applied in r1486665. text/xml for tika-server Key: TIKA-1127 URL: https://issues.apache.org/jira/browse/TIKA-1127 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.4 [~amosavian] contributed this patch from Github to provide text/xml to tika-server: https://github.com/apache/tika/pull/3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1086) Tika-bundle 1.3 does not import org.w3c.dom package
[ https://issues.apache.org/jira/browse/TIKA-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1086: Fix Version/s: (was: 1.2) 1.4 Tika-bundle 1.3 does not import org.w3c.dom package --- Key: TIKA-1086 URL: https://issues.apache.org/jira/browse/TIKA-1086 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Gaurav Fix For: 1.4 Attachments: TIKA-1086.svn.diff The tika-bundle 1.3 version does not import org.w3c.dom package, as a result it is not able to parse DOM based documents such as Microsoft Word (docx) documents. This issue does not have in version 1.2 as it does import the necessary package and therefore the parsing of the documents work fine. Can someone please look into the issue, as Microsoft Word is a very popular document. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [DISCUSS] Apache Tika 1.4 RC?
+1, thanks Chris! Mike McCandless http://blog.mikemccandless.com On Mon, May 27, 2013 at 1:06 PM, Mattmann, Chris A (398J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, I have some free cycles this week -- and the energy to produce a Tika 1.4 RC. Sound good? I cleaned up JIRA and got all resolved (22) issues done and scheduled for 1.4. Did I miss anything? If I don't hear any objections expect an RC #1 for 1.4 by the end of the week. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (TIKA-1127) text/xml for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667934#comment-13667934 ] Ali Mosavian commented on TIKA-1127: Ghansk Chris! text/xml for tika-server Key: TIKA-1127 URL: https://issues.apache.org/jira/browse/TIKA-1127 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.4 [~amosavian] contributed this patch from Github to provide text/xml to tika-server: https://github.com/apache/tika/pull/3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Deleted] (TIKA-1127) text/xml for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Mosavian updated TIKA-1127: --- Comment: was deleted (was: Ghansk Chris!) text/xml for tika-server Key: TIKA-1127 URL: https://issues.apache.org/jira/browse/TIKA-1127 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.4 [~amosavian] contributed this patch from Github to provide text/xml to tika-server: https://github.com/apache/tika/pull/3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1127) text/xml for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667935#comment-13667935 ] Ali Mosavian commented on TIKA-1127: Thanks Chris! text/xml for tika-server Key: TIKA-1127 URL: https://issues.apache.org/jira/browse/TIKA-1127 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.4 [~amosavian] contributed this patch from Github to provide text/xml to tika-server: https://github.com/apache/tika/pull/3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1127) text/xml for tika-server
[ https://issues.apache.org/jira/browse/TIKA-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667965#comment-13667965 ] Chris A. Mattmann commented on TIKA-1127: - np probs, thanks to you, Ali! text/xml for tika-server Key: TIKA-1127 URL: https://issues.apache.org/jira/browse/TIKA-1127 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.4 [~amosavian] contributed this patch from Github to provide text/xml to tika-server: https://github.com/apache/tika/pull/3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira