[jira] [Updated] (TIKA-593) Tika network server
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-593: --- Attachment: TIKA-593.Mattmann.032612.patch.2.txt - ok tests passing, mostly. Will finish tomorrow morning! Tika network server --- Key: TIKA-593 URL: https://issues.apache.org/jira/browse/TIKA-593 Project: Tika Issue Type: New Feature Components: general Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Chris A. Mattmann Fix For: 1.2 Attachments: TIKA-593.Mattmann.032612.patch.2.txt, TIKA-593.Mattmann.032612.patch.txt, TIKA-593_pom.diff It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container. I'd like to be able to set up and run such a server like this: $ java -jar tika-app.jar --port 1234 We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-593) Tika network server
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-593: --- Attachment: TIKA-593.Mattmann.032712.patch.2.txt Tika network server --- Key: TIKA-593 URL: https://issues.apache.org/jira/browse/TIKA-593 Project: Tika Issue Type: New Feature Components: general Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Chris A. Mattmann Fix For: 1.2 Attachments: TIKA-593.Mattmann.032612.patch.2.txt, TIKA-593.Mattmann.032612.patch.txt, TIKA-593.Mattmann.032712.patch.2.txt, TIKA-593.Mattmann.032712.patch.txt, TIKA-593_pom.diff It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container. I'd like to be able to set up and run such a server like this: $ java -jar tika-app.jar --port 1234 We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-593) Tika network server
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-593: --- Attachment: TIKA-593.Mattmann.032612.patch.txt - Max FYI my current progress. I'm trying to get the unit tests rewritten but they are failing right now. Check out MetadataResource to see. The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. I will go to the CXF lists tomorrow with my question about the failing unit tests. Tika network server --- Key: TIKA-593 URL: https://issues.apache.org/jira/browse/TIKA-593 Project: Tika Issue Type: New Feature Components: general Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Chris A. Mattmann Fix For: 1.2 Attachments: TIKA-593.Mattmann.032612.patch.txt, TIKA-593_pom.diff It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container. I'd like to be able to set up and run such a server like this: $ java -jar tika-app.jar --port 1234 We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-874) Identify FITS (Flexible Image Transport System) files
[ https://issues.apache.org/jira/browse/TIKA-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-874: --- Affects Version/s: (was: 1.2) (was: 1.1) Fix Version/s: 1.2 - update fix version, no affects version since new feature. Identify FITS (Flexible Image Transport System) files - Key: TIKA-874 URL: https://issues.apache.org/jira/browse/TIKA-874 Project: Tika Issue Type: Improvement Components: mime Reporter: Peter May Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.2 Attachments: fits_support.patch Tika does not have a defined signature for application/fits files. I have created a patch (based on file(1) magic) to address identification of such files, including a simple unit test. This patch only handles identification, not parsing of FITS files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-817: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 (PPT/PPTX) Missing date/time in text content. - Key: TIKA-817 URL: https://issues.apache.org/jira/browse/TIKA-817 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.2 Missing date/time text in text content for PPT and PPTX files. The date and time are missing from the text content. This occurs when one chooses the following with MS-PowerPoint 2010: 1) Insert 2) Date Time 3) Update automatically 4) save to PPT or PPTX -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-861) Parse links in PDF
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-861: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Parse links in PDF -- Key: TIKA-861 URL: https://issues.apache.org/jira/browse/TIKA-861 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Reporter: Sasha Goodman Priority: Minor Labels: links, pdfbox Fix For: 1.2 Original Estimate: 4h Remaining Estimate: 4h Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. The PDF2XHTML method loops through the annotations. See: {code:java} 136: for(Object o : page.getAnnotations()) { {code} I found some code for dealing with links in annotations: http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link It involves checking the class. {code:java} if( annotation instanceof PDAnnotationLink ) { PDAnnotationLink link = (PDAnnotationLink)annotation; {code} I hope this helps someone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding
[ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-868: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 TXT parser does not honour the specified encoding - Key: TIKA-868 URL: https://issues.apache.org/jira/browse/TIKA-868 Project: Tika Issue Type: Bug Reporter: Daniel Bonniot de Ruisselet Fix For: 1.2 With input text Indanyl, the encoding is recognized as IBM500, even when UTF-8 is specified explicitly. I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.2 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p
[jira] [Updated] (TIKA-816) (XLS/XLSX) Improperly formatted date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-816: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 (XLS/XLSX) Improperly formatted date/time in text content. -- Key: TIKA-816 URL: https://issues.apache.org/jira/browse/TIKA-816 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.2 Improperly formated text content for XLS and XLSX files. The date and time are not formatted as date/time data but rather floating point numbers. This occurs for cells with the content as =now() or =today(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-605) Tika GDAL parser
[ https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-605: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Tika GDAL parser Key: TIKA-605 URL: https://issues.apache.org/jira/browse/TIKA-605 Project: Tika Issue Type: New Feature Components: parser Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: gdal, integration, tika Fix For: 1.2 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, TIKA-605.Mattmann.092511.patch.txt Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser around GDAL. See here: http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-819: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Make Option to Exclude Embedded Files' Text for Text Content Key: TIKA-819 URL: https://issues.apache.org/jira/browse/TIKA-819 Project: Tika Issue Type: New Feature Components: general Affects Versions: 1.0 Environment: Windows-7 + JDK 1.6 u26 Reporter: Albert L. Fix For: 1.2 It would be nice to be able to disable text content from embedded files. For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-758: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Address TODOs when we upgrade to next PDFBox release Key: TIKA-758 URL: https://issues.apache.org/jira/browse/TIKA-758 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Fix For: 1.2 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-776) ExifTool Embedder
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-776: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 ExifTool Embedder - Key: TIKA-776 URL: https://issues.apache.org/jira/browse/TIKA-776 Project: Tika Issue Type: New Feature Components: metadata Affects Versions: 1.0 Environment: ExifTool is required (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: embed, exiftool, patch Fix For: 1.2 Attachments: tika-parsers-exiftool-embed-patch.txt This patch adds an ExifTool ExternalEmbedder which builds upon the work in issue TIKA-774 and TIKA-775. In the tika-parsers an ExiftoolExternalEmbedder is added which extends ExternalEmbedder to programmatically create an Embedder which calls the ExifTool command line to embed tika metadata into a file stream and an ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and XMP fields then parses the resulting file stream to verify the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-820) Locator is unset for HTML parser
[ https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-820: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Locator is unset for HTML parser Key: TIKA-820 URL: https://issues.apache.org/jira/browse/TIKA-820 Project: Tika Issue Type: Bug Components: general, parser Affects Versions: 1.0 Reporter: Daniel Bonniot de Ruisselet Labels: patch Fix For: 1.2 Attachments: text-locator.patch The HtmlParser does not call setDocumentLocator(Locator locator) on the user's content handler. Patch and unit test attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler
[ https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-754: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler -- Key: TIKA-754 URL: https://issues.apache.org/jira/browse/TIKA-754 Project: Tika Issue Type: Improvement Affects Versions: 0.10, 1.0 Reporter: Pablo Queixalos Priority: Minor Fix For: 1.2 Attachments: TIKA-754.poc.patch As seen with some parsers (PDF, PPT), some text blocks still contains text carriage returns ('\n') in the outputted XHTML. A global fix for this could be located in XHTMLContentHandler.characters(...). By analyzing the given char array, when a '\n' char is encountered insert a BR element instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-775) Embed Capabilities
[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-775: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Embed Capabilities -- Key: TIKA-775 URL: https://issues.apache.org/jira/browse/TIKA-775 Project: Tika Issue Type: Improvement Components: general, metadata Affects Versions: 1.0 Environment: The default ExternalEmbedder requires that sed be installed. Reporter: Ray Gauss II Labels: embed, patch Fix For: 1.2 Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt This patch defines and implements the concept of embedding tika metadata into a file stream, the reverse of extraction. In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder implementation meant to be extended or configured are added. These classes are essentially a reverse flow of the existing Parser and ExternalParser classes. In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then verify the operation by parsing the resulting stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-593) Tika network server
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-593: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Tika network server --- Key: TIKA-593 URL: https://issues.apache.org/jira/browse/TIKA-593 Project: Tika Issue Type: New Feature Components: general Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Chris A. Mattmann Fix For: 1.2 Attachments: TIKA-593_pom.diff It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container. I'd like to be able to set up and run such a server like this: $ java -jar tika-app.jar --port 1234 We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects
[ https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-859: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 DublinCore Metadata Keys Should be Prefixed and Property Objects Key: TIKA-859 URL: https://issues.apache.org/jira/browse/TIKA-859 Project: Tika Issue Type: Improvement Components: metadata Affects Versions: 1.1 Reporter: Ray Gauss II Fix For: 1.2 Attachments: dublincore-prefixed-patch.diff To help avoid collisions of key names in interfaces Metadata implements and allow for more precise definition of DublinCore the keys should be defined as Property objects with the object name and name attribute containing a prefix and the existing String keys deprecated, i.e. {code:title=DublinCore.java} String SUBJECT = subject; {code} would become: {code:title=DublinCore.java} @Deprecated String SUBJECT = subject; Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + subject); {code} Since the use of the simpler key definition is desired eventually, at some point in the future, perhaps 2.0, these prefixed definitions could themselves be deprecated and the move made back to the simpler names. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-774: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, newbie, patch, Fix For: 1.2 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library
[ https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-842: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 IPTC Properties Should be Defined Completely and Independently of the Drew Library -- Key: TIKA-842 URL: https://issues.apache.org/jira/browse/TIKA-842 Project: Tika Issue Type: Improvement Components: metadata Affects Versions: 1.0 Reporter: Ray Gauss II Fix For: 1.2 Attachments: IPTC-metadata-def-patch.diff, iptc-dublincore-aliased-patch.diff, metadata-remove-iptc-patch.diff All of the IPTC XMP specification should be defined in tika-core and should not be reliant on the Drew Noakes library as it is incomplete in its support of the standard and the properties are not defined in proper namespaces or prefixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-862) JPSS HDF5 files not being detected appropriately
[ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-862: --- Component/s: parser Affects Version/s: 1.0 - classify and identify version (I think) JPSS HDF5 files not being detected appropriately Key: TIKA-862 URL: https://issues.apache.org/jira/browse/TIKA-862 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Richard Yu Assignee: Chris A. Mattmann As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this: from [~minfing]: {quote} We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line: {noformat} [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \ /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 Content-Encoding: windows-1252 Content-Length: 22187952 Content-Type: text/plain resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 [ryu@localhost hdf5extractor]$ {noformat} We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata). Let me know if more information is needed. Thanks! Richard {quote} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-605) Tika GDAL parser
[ https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-605: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Tika GDAL parser Key: TIKA-605 URL: https://issues.apache.org/jira/browse/TIKA-605 Project: Tika Issue Type: New Feature Components: parser Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: gdal, integration, tika Fix For: 1.1 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, TIKA-605.Mattmann.092511.patch.txt Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser around GDAL. See here: http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler
[ https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-754: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler -- Key: TIKA-754 URL: https://issues.apache.org/jira/browse/TIKA-754 Project: Tika Issue Type: Improvement Affects Versions: 0.10, 1.0 Reporter: Pablo Queixalos Priority: Minor Fix For: 1.1 Attachments: TIKA-754.poc.patch As seen with some parsers (PDF, PPT), some text blocks still contains text carriage returns ('\n') in the outputted XHTML. A global fix for this could be located in XHTMLContentHandler.characters(...). By analyzing the given char array, when a '\n' char is encountered insert a BR element instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)
[ https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-757: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Address TODOs when we upgrade to next POI release (3.8 beta 5) -- Key: TIKA-757 URL: https://issues.apache.org/jira/browse/TIKA-757 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Fix For: 1.1 I'm opening a blanket issue to remind us all to address the TODOs in the sources for when we upgrade to the next POI. I think this (a single blanket issue) is better than keeping separate issues open even though they are technically fixed? For example, I've committed TIKA-753 (speedups for embedded office docs), yet it included some TODOs for further speedups possible once we upgrade POI. Rather than keeping TIKA-753 (and others like it) open, I think we should resolve them and let this issue cover all the TODOs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-758: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Address TODOs when we upgrade to next PDFBox release Key: TIKA-758 URL: https://issues.apache.org/jira/browse/TIKA-758 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Fix For: 1.1 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.1 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-565) Improved OSGi bundling
[ https://issues.apache.org/jira/browse/TIKA-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-565: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Improved OSGi bundling -- Key: TIKA-565 URL: https://issues.apache.org/jira/browse/TIKA-565 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Jukka Zitting Fix For: 1.1 Attachments: core-bundle-fix.diff I'd like to add proper integration tests for tika-bundle and expose the Tika facade object as a service so other bundles could access it easily like this: @Reference private Tika tika; It would also be nice to allow other OSGi bundles to expose their Parser implementations as pluggable services and have the Tika bundle automatically pick up and use them along with all the embedded parsers it contains. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira