[jira] [Updated] (TIKA-291) Adobe InDesign support
[ https://issues.apache.org/jira/browse/TIKA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-291: - Labels: new-parser (was: ) Adobe InDesign support -- Key: TIKA-291 URL: https://issues.apache.org/jira/browse/TIKA-291 Project: Tika Issue Type: Improvement Components: parser Reporter: Jukka Zitting Priority: Minor Labels: new-parser Attachments: simple_test-1.indd It would be great if Tika could extract content from Adobe InDesign documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-94) Speech recognition
[ https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-94: Labels: new-parser (was: ) Speech recognition -- Key: TIKA-94 URL: https://issues.apache.org/jira/browse/TIKA-94 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Priority: Minor Labels: new-parser Like OCR for image files (TIKA-93), we could try using speech recognition to extract text content (where available) from audio (and video!) files. The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and comes with a friendly license. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-289: - Labels: new-parser (was: ) Add magic byte patterns from file(1) Key: TIKA-289 URL: https://issues.apache.org/jira/browse/TIKA-289 Project: Tika Issue Type: Improvement Components: mime Reporter: Jukka Zitting Priority: Minor Labels: new-parser Attachments: file-has-magic-tika-missing.txt, file-mimes-missing.txt As discussed in TIKA-285, the file(1) command comes with a pretty comprehensive set of magic byte patterns. It would be nice to get those patterns included also in Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-617. Resolution: Won't Fix The underlying exception is {code} Caused by: java.util.zip.DataFormatException: invalid distance too far back at java.util.zip.Inflater.inflateBytes(Native Method) {code} So, I'm closing this as Won't Fix. If anyone objects, please reopen. Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: i- INFO - unsupported/disabled operation: R4% INFO - unsupported/disabled operation: ) INFO - unsupported/disabled operation: Re.8 INFO - unsupported/disabled operation: e. INFO - unsupported/disabled operation: FE)- WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: R3% INFO - unsupported/disabled operation: T Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at
[jira] [Updated] (TIKA-627) Support X12 files
[ https://issues.apache.org/jira/browse/TIKA-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-627: - Labels: new-parser (was: ) Support X12 files - Key: TIKA-627 URL: https://issues.apache.org/jira/browse/TIKA-627 Project: Tika Issue Type: New Feature Components: mime, parser Reporter: Jukka Zitting Priority: Minor Labels: new-parser X12 [1] is a standardized data interchange format. It would be nice if Tika could understand such files. [1] http://en.wikipedia.org/wiki/ASC_X12 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-669) Backup plan for parsing
[ https://issues.apache.org/jira/browse/TIKA-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-669. Resolution: Duplicate Backup plan for parsing --- Key: TIKA-669 URL: https://issues.apache.org/jira/browse/TIKA-669 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Currently once a document type has been detected we direct the document to the one parser that best matches the detected type. In practice there are cases where that parser finds that it in fact cannot parse this document, for example when something that looked like XML turns out to have syntax errors. For such cases it would be nice if the CompositeParser could then retry parsing the document with a more generic backup parser, like the plain text parser for malformed XML. Implementing this would require some level of buffering and redirection of both parser input and output. Input buffering is easy, but for output buffering we'd probably need to implement new ContentHandler and Metadata layers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-663) JSP files data extraction failed
[ https://issues.apache.org/jira/browse/TIKA-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-663. -- Resolution: Fixed JSP files data extraction failed Key: TIKA-663 URL: https://issues.apache.org/jira/browse/TIKA-663 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Environment: Windows, JAva 6 Reporter: samraj Attachments: File_1.jsp, File_2.jsp, File_3.jsp We have worked with tika extraction. In 0.8 jsp file contents extracted well.. But in 0.9 the same files are not extracted well. Pls give the solution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-651) Unescaped attribute value generated
[ https://issues.apache.org/jira/browse/TIKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342499#comment-14342499 ] Tyler Palsulich commented on TIKA-651: -- Is there any update on XML processing libraries that we use? Do we still want to change up our dependencies? If not, I'll close this this week. Unescaped attribute value generated --- Key: TIKA-651 URL: https://issues.apache.org/jira/browse/TIKA-651 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Reporter: Raimund Merkert Assignee: Jukka Zitting Attachments: XHTMLSerializer.java I've converted a word document that contains hyperlinks with a complex query component. The character is not escaped and mozilla complains about that when I write out the XHTML via a content handler that I wrote. It's not clear to me whether or not my contenthandler should assume attributes are properly escaped or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-291) Adobe InDesign support
[ https://issues.apache.org/jira/browse/TIKA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-291. Resolution: Duplicate Adobe InDesign support -- Key: TIKA-291 URL: https://issues.apache.org/jira/browse/TIKA-291 Project: Tika Issue Type: Improvement Components: parser Reporter: Jukka Zitting Priority: Minor Labels: new-parser Attachments: simple_test-1.indd It would be great if Tika could extract content from Adobe InDesign documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Curating Issues
You da man Sent from my iPhone On Mar 1, 2015, at 2:36 PM, Tyler Palsulich tpalsul...@gmail.com wrote: Alright. I'm up to TIKA-694 and still goin'. :) I've started labeling some issues as new-parser and newbie. I think these should be helpful for organization. Please let me know if there is another label we've already been using for those. I put new-parser on any requests to support a new filetype, even if it doesn't require a full on Parser (e.g. just magic). newbie should be used for new contributors. I'll take no offense if someone reopens/closes anything after I've touched it. Tyler On Sat, Feb 28, 2015 at 11:59 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hey Tyler if you want to take a whack, here are some criteria I tend to use: 1. Bug report from 1+ years old. - Close it - either not reproducible, fixed in a later version and not come back to, or not as bad of a bug anymore since it’s not a blocker. 2. Feature request from 1+ years old that no one has acted upon. - Good candidate for closing - if it was important someone would have acted up on it. 3. Issue from 1+ years old with lots of discussion on it - Poke the issue - see if a consensus can be reached, if not move forward and close. 4. Issue that is your own that you aren’t interested in anymore that is 1+ years old - Close it you didn’t work on it then, may not get back to it and no one else has 5. Issue that is 2+ years old - Close, regardless, unless it has patch 6. Issue that is 1+ years old, with patch, uncommitted - Try to apply patch or minimal effort to bring current with trunk and apply - if too much work ask for help - if 1+ weeks and no one replies, close it and move forward There are more but that’s a start. I’ll check out this article thanks for sending it. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, February 28, 2015 at 8:53 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Curating Issues Hi Folks, I just read an article [0] about managing a large project's issues list. Tika currently has 331 open issues. Do we know if all of these have been triaged? At what point do we want to label an issue as stale and close it off? What is our preferred split between when to make an issue and when to send a message to the mailing list? Have a good weekend, Tyler [0] http://words.steveklabnik.com/how-to-be-an-open-source-gardener?r=1
[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342522#comment-14342522 ] Tyler Palsulich commented on TIKA-715: -- This seems like it's worth looking into. It would be awesome if someone could generate a list of Parsers which generate invalid XHTML and need attention. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.8 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
[jira] [Commented] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342524#comment-14342524 ] Ken Krugler commented on TIKA-539: -- Hi Tyler - I see you closed this as fixed, but I don't remember the change that resolved it...do you have details? Encoding detection is too biased by encoding in meta tag Key: TIKA-539 URL: https://issues.apache.org/jira/browse/TIKA-539 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 0.8, 0.9, 0.10 Reporter: Reinhard Schwab Assignee: Ken Krugler Fix For: 1.8 Attachments: TIKA-539.patch, TIKA-539_2.patch if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = htmlhead\n + meta http-equiv=\content-type\ content=\application/xhtml+xml; charset=iso-8859-1\ / + /headbodyÜber den Wolken\n/body/html; /** * @param args * @throws IOException * @throws TikaException * @throws SAXException */ public static void main(String[] args) throws IOException, SAXException, TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/html); metadata.set(Metadata.CONTENT_ENCODING, UTF-8); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes(UTF-8)); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(1); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342542#comment-14342542 ] Tyler Palsulich commented on TIKA-727: -- [~gagravarr], if you applied the above patch, is this issue good to close as Fixed? Improve the outputed XHTML by HSLFExtractor --- Key: TIKA-727 URL: https://issues.apache.org/jira/browse/TIKA-727 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.10 Reporter: Pablo Queixalos Priority: Minor Attachments: HSLFExtractor.java, HSLFExtractor.patch The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-740) SAX parser used for HTML
[ https://issues.apache.org/jira/browse/TIKA-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-740. Resolution: Won't Fix SAX parser used for HTML Key: TIKA-740 URL: https://issues.apache.org/jira/browse/TIKA-740 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Erik Hetzner Attachments: a221657.html {noformat} egh@gales[510] 1 :~/d/software/tika-trunk $ java -jar tika-app/target/tika-app-1.0-SNAPSHOT.jar -v http://www.almasry-alyoum.com/article2.aspx?ArticleID=221657 /dev/null Exception in thread main org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:367) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97) Caused by: org.xml.sax.SAXParseException: The element type td must be terminated by the matching end-tag /td. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1749) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65) ... 6 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-758) Address TODOs when we upgrade to next PDFBox release
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342589#comment-14342589 ] Tyler Palsulich commented on TIKA-758: -- [~talli...@apache.org], now that we're at PDFBox 1.8.8, can we remove the workaround? I removed it locally and all tests pass. Address TODOs when we upgrade to next PDFBox release Key: TIKA-758 URL: https://issues.apache.org/jira/browse/TIKA-758 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Attachments: TIKA-758.Palsulich.061714.patch Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342616#comment-14342616 ] Tyler Palsulich commented on TIKA-819: -- Is there still interest in this cursory option? It shouldn't be difficult to add, if so! Make Option to Exclude Embedded Files' Text for Text Content Key: TIKA-819 URL: https://issues.apache.org/jira/browse/TIKA-819 Project: Tika Issue Type: New Feature Components: general Affects Versions: 1.0 Environment: Windows-7 + JDK 1.6 u26 Reporter: Albert L. Fix For: 1.8 It would be nice to be able to disable text content from embedded files. For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-807) PHP version of Tika
[ https://issues.apache.org/jira/browse/TIKA-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-807. Resolution: Fixed I think this is old enough to close and especially with an actively developed, downstream library. PHP version of Tika --- Key: TIKA-807 URL: https://issues.apache.org/jira/browse/TIKA-807 Project: Tika Issue Type: New Feature Components: packaging Reporter: Ingo Renner Priority: Minor Labels: PHP Inspired by TIKA-773 the outcome of this issue should be a PHP library/wrapper to easily work with Tika in PHP applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats
[ https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-821: - Labels: new-parser (was: ) Support detecting old MIcrosoft Works Word Processor formats Key: TIKA-821 URL: https://issues.apache.org/jira/browse/TIKA-821 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.1 Reporter: Antoni Mylka Assignee: Antoni Mylka Labels: new-parser An issue similar to TIKA-812. This time it's about old Works Word Processor formats. They use an OLE2 structure, but the top-level entry is called MatOST, they are not supported by the OfficeParser. I would like to: # Add a magic to tika-mimetypes.xml to mark the file as ms-works if MatOST is found. (After TIKA-806 we officially like those). # Add an 'if' to POIFSContainerDetector to look for MatOST. I'm not creating a separate media type for this (like I did in TIKA-812) because no parser supports it anyway. In TIKA-812 it was necessary, because ExcelParser can't work with all vnd.ms-works files but can work with 7.0 spreadsheets. In this case there is no gain in a separate mime type. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-676) Boilerpipe fails
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-676. -- Resolution: Fixed No exception is thrown with the file with Tika 1.8-SNAPSHOT. So, closing this as fixed. Open a new issue for upgrading the dependency if relevant. Boilerpipe fails Key: TIKA-676 URL: https://issues.apache.org/jira/browse/TIKA-676 Project: Tika Issue Type: Bug Components: parser Reporter: Gabriele Kahlout Priority: Minor This is apparently a [boilerpipe issue |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the [Web API edition | http://boilerpipe-web.appspot.com/]. {code} $ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 656880 656880 0 17650 0 --:--:-- 0:00:03 --:--:-- 18698Exception in thread main org.xml.sax.SAXException: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to boilerpipe again 100 128k0 128k0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735 at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108) at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169) at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279) at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197) at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61) at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016) at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565) at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-676) Boilerpipe fails
[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342513#comment-14342513 ] Tyler Palsulich commented on TIKA-676: -- There is a [fork|http://search.maven.org/#artifactdetails%7Ccom.robbypond%7Cboilerpipe%7C1.2.3%7Cjar] of Boilerpipe available on Maven Central. Should we switch to that? I'd prefer to stay with the main project. But, it doesn't appear available in Central. Boilerpipe fails Key: TIKA-676 URL: https://issues.apache.org/jira/browse/TIKA-676 Project: Tika Issue Type: Bug Components: parser Reporter: Gabriele Kahlout Priority: Minor This is apparently a [boilerpipe issue |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the [Web API edition | http://boilerpipe-web.appspot.com/]. {code} $ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 656880 656880 0 17650 0 --:--:-- 0:00:03 --:--:-- 18698Exception in thread main org.xml.sax.SAXException: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to boilerpipe again 100 128k0 128k0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735 at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108) at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169) at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279) at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197) at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61) at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016) at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565) at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Curating Issues
Alright. I'm up to TIKA-694 and still goin'. :) I've started labeling some issues as new-parser and newbie. I think these should be helpful for organization. Please let me know if there is another label we've already been using for those. I put new-parser on any requests to support a new filetype, even if it doesn't require a full on Parser (e.g. just magic). newbie should be used for new contributors. I'll take no offense if someone reopens/closes anything after I've touched it. Tyler On Sat, Feb 28, 2015 at 11:59 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hey Tyler if you want to take a whack, here are some criteria I tend to use: 1. Bug report from 1+ years old. - Close it - either not reproducible, fixed in a later version and not come back to, or not as bad of a bug anymore since it’s not a blocker. 2. Feature request from 1+ years old that no one has acted upon. - Good candidate for closing - if it was important someone would have acted up on it. 3. Issue from 1+ years old with lots of discussion on it - Poke the issue - see if a consensus can be reached, if not move forward and close. 4. Issue that is your own that you aren’t interested in anymore that is 1+ years old - Close it you didn’t work on it then, may not get back to it and no one else has 5. Issue that is 2+ years old - Close, regardless, unless it has patch 6. Issue that is 1+ years old, with patch, uncommitted - Try to apply patch or minimal effort to bring current with trunk and apply - if too much work ask for help - if 1+ weeks and no one replies, close it and move forward There are more but that’s a start. I’ll check out this article thanks for sending it. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Tyler Palsulich tpalsul...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Saturday, February 28, 2015 at 8:53 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Curating Issues Hi Folks, I just read an article [0] about managing a large project's issues list. Tika currently has 331 open issues. Do we know if all of these have been triaged? At what point do we want to label an issue as stale and close it off? What is our preferred split between when to make an issue and when to send a message to the mailing list? Have a good weekend, Tyler [0] http://words.steveklabnik.com/how-to-be-an-open-source-gardener?r=1
[jira] [Closed] (TIKA-694) On extraction, get properties AND / OR content extraction
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-694. Resolution: Won't Fix On extraction, get properties AND / OR content extraction - Key: TIKA-694 URL: https://issues.apache.org/jira/browse/TIKA-694 Project: Tika Issue Type: Wish Components: parser Affects Versions: 1.0 Environment: All OS Reporter: Etienne Jouvin Priority: Minor Attachments: Tika-1.0.zip I use TIKA to extract properties, and only, on Office files. The parser goes throw the document content and this is not necessary and slow down the process. It would be nice to have choice to extract only properties or not. What I did was the following: Extension of AutoDetectParser to override the parse method. Then in the ParseContext instance, I put a flag with boolean true to say only extract the properties. And for example, on office file, I extended OfficeParser class. During parse method, I check the flag, and if equals to true, I removed all the extraction from the content. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-354) ProfilingHandler should take a length-limiting parameter
[ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342529#comment-14342529 ] Ken Krugler commented on TIKA-354: -- Better speed is still important, as a 2x improvement from TIKA-1549 is good but means that now it's only 45% of the web crawl time that's spent determining the language, versus 90%. However the right way to do this is (with a new detector library) internally sampling until the target confidence is reached, versus the caller having to decide how much text to analyze. So net-net, yes I think this can be closed. ProfilingHandler should take a length-limiting parameter Key: TIKA-354 URL: https://issues.apache.org/jira/browse/TIKA-354 Project: Tika Issue Type: Improvement Components: languageidentifier Affects Versions: 0.5 Reporter: Vivek Magotra Assignee: Ken Krugler Attachments: TIKA-354-2.patch, TIKA-354.patch ProfilingHandler currently parses the entire document (thereby analyzing n-grams for the entire doc). ProfilingHandler should take a length-limiting parameter that allows a user to specify the amount of data that should get analyzed. In fact, by default that limit should be set to something like 8K. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-369) Improve accuracy of language detection
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342531#comment-14342531 ] Ken Krugler commented on TIKA-369: -- Hi Tyler - detection speed is an issue, but Tika also suffered from accuracy. In Mike McCandless's tests, Tika was both 10x slower than language-detection, and had about a 3.5x higher error rate IIRC (2.8% error rate vs. 0.8%). I think this issue should be left open, as it has interested details on possible replacements for the current code that I don't think we want to lose. Improve accuracy of language detection -- Key: TIKA-369 URL: https://issues.apache.org/jira/browse/TIKA-369 Project: Tika Issue Type: Improvement Components: languageidentifier Affects Versions: 0.6 Reporter: Ken Krugler Assignee: Ken Krugler Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf, textcat.pdf Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues: 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text. 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-756) XMP output from Tika CLI
[ https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342590#comment-14342590 ] Tyler Palsulich commented on TIKA-756: -- The only blocker on this is tika-xmp having a dependency on tika-parsers, right? XMP output from Tika CLI Key: TIKA-756 URL: https://issues.apache.org/jira/browse/TIKA-756 Project: Tika Issue Type: New Feature Components: cli, metadata Reporter: Jukka Zitting Assignee: Jörg Ehrlich Labels: metadata, xmp Attachments: tika-xmp.patch, tika-xmp_styleAndHeader.patch It would be great if the Tika CLI could output metadata also in the XMP format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-788) DWG parser infinite loop on possibly corrupt file
[ https://issues.apache.org/jira/browse/TIKA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342609#comment-14342609 ] Tyler Palsulich commented on TIKA-788: -- [~seegler], it looks like your stack trace is related to parsing an mp3 file. Does anyone have a dwg file that triggers this error? Ideally, they would also have the set of Metadata values extracted by AutoCAD. DWG parser infinite loop on possibly corrupt file - Key: TIKA-788 URL: https://issues.apache.org/jira/browse/TIKA-788 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Stas Shaposhnikov When parsing some dwg items, it is possible that the parser may cause itself to go into an infinite loop. Attached is the file causing the problem. Here is a possible patch that will at least proceed until an error is thrown. {noformat} === modified file 'tika-parsers/src/main/java/org/apache/tika/parser/dwg/DWGParser.java' --- tika-parsers/src/main/java/org/apache/tika/parser/dwg/DWGParser.java 2011-11-24 11:30:33 + +++ tika-parsers/src/main/java/org/apache/tika/parser/dwg/DWGParser.java 2011-11-25 05:27:41 + @@ -274,8 +274,10 @@ return false; } while (toSkip 0) { -byte[] skip = new byte[Math.min((int) toSkip, 0x4000)]; -IOUtils.readFully(stream, skip); +byte[] skip = new byte[(int) Math.min(toSkip, 0x4000)]; +if (IOUtils.readFully(stream, skip) == -1) { + return false; //invalid skip +} toSkip -= skip.length; } return true; {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342496#comment-14342496 ] Tyler Palsulich commented on TIKA-634: -- [~gagravarr], is this issue still relevant? Command Line Parser for Metadata Extraction --- Key: TIKA-634 URL: https://issues.apache.org/jira/browse/TIKA-634 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.9 Reporter: Nick Burch Assignee: Nick Burch Priority: Minor As discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E This issue is to track improvements in the ExternalParser support to handle metadata extraction, and probably easier configuration of an external parser too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-682) Creative Suite formats support
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-682: - Labels: new-parser (was: ) Creative Suite formats support -- Key: TIKA-682 URL: https://issues.apache.org/jira/browse/TIKA-682 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.8 Reporter: Vivian Li Labels: new-parser Attachments: Untitled-1.indd, myfile.psd, myfile.xmp Is it possible to support Creative Suite formats, such as PSD, InDesign, etc.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-682) Creative Suite formats support
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-682: - Affects Version/s: (was: 0.9) 1.8 Creative Suite formats support -- Key: TIKA-682 URL: https://issues.apache.org/jira/browse/TIKA-682 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.8 Reporter: Vivian Li Labels: new-parser Attachments: Untitled-1.indd, myfile.psd, myfile.xmp Is it possible to support Creative Suite formats, such as PSD, InDesign, etc.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-682) Creative Suite formats support
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-682: - Component/s: (was: metadata) parser Creative Suite formats support -- Key: TIKA-682 URL: https://issues.apache.org/jira/browse/TIKA-682 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.8 Reporter: Vivian Li Labels: new-parser Attachments: Untitled-1.indd, myfile.psd, myfile.xmp Is it possible to support Creative Suite formats, such as PSD, InDesign, etc.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-712) Master slide text isn't extracted
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342518#comment-14342518 ] Tyler Palsulich commented on TIKA-712: -- Is there any update on this? Otherwise, I'll close it as Won't Fix later this week. Master slide text isn't extracted - Key: TIKA-712 URL: https://issues.apache.org/jira/browse/TIKA-712 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx It looks like we are not getting text from the master slide for PPT and PPTX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-713) Tika can not parse all of the persian pdf files
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-713. -- Resolution: Fixed Tika can not parse all of the persian pdf files --- Key: TIKA-713 URL: https://issues.apache.org/jira/browse/TIKA-713 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Reporter: Ahmad Ajiloo Attachments: Complex.pdf, Simple2.pdf, Simple3.pdf, ebrat.pdf Hello I used Tika (of course in Nutch) to parse some persian pdf files. some of the files clearly transformed to a plain text. but about some of them, output was corrupted. I used ICU4J v4 library and the text changed to right-to-left mode. but the mentioned problem didn't resolve. insofar as Tika can not understand any charachter of input persian pdf file! {quote} I copy this text from my pdf file via Document Viewer in Linux: this is a clearly persian text ! -- هر روز پس از نماز صبح، سوره مباركه الرحمن را تا فباي آلاء ربكما تكذبان بخواند. ) اين يعني 21 آيه اول سوره ، كه در قرآن رسم الخط عثمانطه تقريبا يك نصف صفحه است. ( همچنين در روايات از حضرت رسول )ص( و ائمه اطهار )ع( آمده كه چند چيز براي قوت حافظه مفيد است: 1- مسواك كردن 2- روزه گرفتن 3- قرائت قرآن؛ مخصوصا آيه الكرسي 4- خوردن عسل 5- خوردن عدس 6- خوردن گوشت نزديک گردن -- Tike returns this output ! -- 92 @A 8 * B C9D !D ) (?) =/ () ,8 ; 8 # + 9!: L #)4 M() * 0 * -3IA J - 2 (+ G H -1 (+ J 5#+C 0T J (+ O - 6R . (+ O - 5 PH. (+ O -4 -- {quote} thanks a lot -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342530#comment-14342530 ] Tyler Palsulich commented on TIKA-539: -- Hi [~kkrugler]. I didn't have a specific fix in mind when I closed it. But, I saw the two related issues have been resolved and no recent commentary. Apologies if the closure was premature. Encoding detection is too biased by encoding in meta tag Key: TIKA-539 URL: https://issues.apache.org/jira/browse/TIKA-539 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 0.8, 0.9, 0.10 Reporter: Reinhard Schwab Assignee: Ken Krugler Fix For: 1.8 Attachments: TIKA-539.patch, TIKA-539_2.patch if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = htmlhead\n + meta http-equiv=\content-type\ content=\application/xhtml+xml; charset=iso-8859-1\ / + /headbodyÜber den Wolken\n/body/html; /** * @param args * @throws IOException * @throws TikaException * @throws SAXException */ public static void main(String[] args) throws IOException, SAXException, TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/html); metadata.set(Metadata.CONTENT_ENCODING, UTF-8); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes(UTF-8)); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(1); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-768) Parser for EDF files
[ https://issues.apache.org/jira/browse/TIKA-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-768: - Labels: edf new-parser (was: edf) Parser for EDF files Key: TIKA-768 URL: https://issues.apache.org/jira/browse/TIKA-768 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Priority: Minor Labels: edf, new-parser In my spare time I'm occasionally working on biological signal processing, and now I have a case where being able to extract normalized metadata from EDF files (European Data Format, http://www.edfplus.info/) would be useful. Thus it would be nice to add a simple Tika parser that understands this format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-766) Trim down the NetCDF dependency
[ https://issues.apache.org/jira/browse/TIKA-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342602#comment-14342602 ] Tyler Palsulich commented on TIKA-766: -- Do we need to look into this more? Now with GRIB support and the new ucar dependencies, we are using more of the functionality. But, does anyone know if there are still licensing issues? The size of tika-app is getting unwieldy, so issues like this are worth investigating. Trim down the NetCDF dependency --- Key: TIKA-766 URL: https://issues.apache.org/jira/browse/TIKA-766 Project: Tika Issue Type: Improvement Components: packaging, parser Reporter: Jukka Zitting Priority: Minor As noted in TIKA-763, the NetCDF dependency contains a few LGPL classes that we should get rid of, ideally without the workaround added for TIKA-763. Additionally, with 4.2MB the NetCDF jar is pretty large and includes lots of stuff that isn't really related to parsing NetCDF and HDF files. It would be nice if the NetCDF project could produce a separately packaged read-only parser library that only contains the stuff needed by Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-770) New ODF metadata keys
[ https://issues.apache.org/jira/browse/TIKA-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342603#comment-14342603 ] Tyler Palsulich commented on TIKA-770: -- [~gagravarr], 3 years later, is it time? New ODF metadata keys - Key: TIKA-770 URL: https://issues.apache.org/jira/browse/TIKA-770 Project: Tika Issue Type: Improvement Components: metadata, parser Reporter: Jukka Zitting Priority: Minor Labels: odf Followup from TIKA-764. {quote} The 2nd step is to add a few extra common keys for the stats that ODF has that aren't covered, then remove the non standard keys {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats
[ https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-821. -- Resolution: Fixed Marking fixed based on committed comment. Support detecting old MIcrosoft Works Word Processor formats Key: TIKA-821 URL: https://issues.apache.org/jira/browse/TIKA-821 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.1 Reporter: Antoni Mylka Assignee: Antoni Mylka Labels: new-parser An issue similar to TIKA-812. This time it's about old Works Word Processor formats. They use an OLE2 structure, but the top-level entry is called MatOST, they are not supported by the OfficeParser. I would like to: # Add a magic to tika-mimetypes.xml to mark the file as ms-works if MatOST is found. (After TIKA-806 we officially like those). # Add an 'if' to POIFSContainerDetector to look for MatOST. I'm not creating a separate media type for this (like I did in TIKA-812) because no parser supports it anyway. In TIKA-812 it was necessary, because ExcelParser can't work with all vnd.ms-works files but can work with 7.0 spreadsheets. In this case there is no gain in a separate mime type. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-630) Dealing with PDF documents from scanning programs
[ https://issues.apache.org/jira/browse/TIKA-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-630. -- Resolution: Fixed Dealing with PDF documents from scanning programs - Key: TIKA-630 URL: https://issues.apache.org/jira/browse/TIKA-630 Project: Tika Issue Type: Improvement Components: general Affects Versions: 0.10 Reporter: Joseph Vychtrle Priority: Minor Labels: ocr, pdf, Hey, sorry I didn't post this to mailing list, I kinda didn't get the confirmation. The issue is that often people don't even realize there is a difference in pdf documents (extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes such a document, it detects pdf content type, but there are only images in there. I don't know how to deal with that. There should be a function that decides on the type of PDF document so that I can take it and use some OCR software for the PDF from scanner software. If there is a way to do that, could please anybody explain how to do that ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342507#comment-14342507 ] Tyler Palsulich commented on TIKA-675: -- Is this still worth implementing? [~gagravarr], if you decide on metadata keys, I can take a crack at implementing this. But, not sure it'd be quick. PackageExtractor should track names of recursively nested resources --- Key: TIKA-675 URL: https://issues.apache.org/jira/browse/TIKA-675 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.10 Reporter: Andrzej Bialecki When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-369) Improve accuracy of language detection
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342533#comment-14342533 ] Tyler Palsulich commented on TIKA-369: -- Thanks, Ken! In that case, I definitely agree. Improve accuracy of language detection -- Key: TIKA-369 URL: https://issues.apache.org/jira/browse/TIKA-369 Project: Tika Issue Type: Improvement Components: languageidentifier Affects Versions: 0.6 Reporter: Ken Krugler Assignee: Ken Krugler Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf, textcat.pdf Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues: 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text. 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342605#comment-14342605 ] Tyler Palsulich commented on TIKA-774: -- Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I agree that we should create another conflicting Parser for image types. ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, newbie, patch, Fix For: 1.8 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-807) PHP version of Tika
[ https://issues.apache.org/jira/browse/TIKA-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342610#comment-14342610 ] Tyler Palsulich commented on TIKA-807: -- [Here|https://github.com/pierroweb/PhpTikaWrapper] is one project which aims to do this. Should we leave this open, in case we want to integrate something within the Tika project? PHP version of Tika --- Key: TIKA-807 URL: https://issues.apache.org/jira/browse/TIKA-807 Project: Tika Issue Type: New Feature Components: packaging Reporter: Ingo Renner Priority: Minor Labels: PHP Inspired by TIKA-773 the outcome of this issue should be a PHP library/wrapper to easily work with Tika in PHP applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-648) Parsing HTML anchors with embedded div faulty
[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-648. Resolution: Won't Fix Parsing HTML anchors with embedded div faulty - Key: TIKA-648 URL: https://issues.apache.org/jira/browse/TIKA-648 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Reporter: Markus Jelsma Using Nutch with Tika 0.9 i cannot extract all two outlinks from a given page [1]. This is because Tika doensn't return the document with the anchor text embedded and Nutch skips empty anchors when collecting outlinks. The raw HTML is: a href=#divbla 1/div/a a href=#bla 2/a But the parsed HTML with tika-app-1.0-SNAPSHOT.jar -h test.html is: a shape=rect href=#/bla 1 a shape=rect href=#bla 2/a [1]: http://people.apache.org/~markus/test.html Also described on the Tika user list: http://search.lucidimagination.com/search/document/e74d7e72fd61543a/parsing_html_anchors_with_embedded_div_faulty -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs
[ https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342521#comment-14342521 ] Tyler Palsulich commented on TIKA-591: -- I bring up tika-batch (from [~talli...@apache.org]) because it's meant to provide a way to reliably run Tika on a large collection of documents -- killing the processing when Tika seems to be hanging indefinitely. But, I'm not sure if it's in an entirely different JVM, or just a different thread -- or if that even matters in regards to this issue. Separate launcer process for forking JVMs - Key: TIKA-591 URL: https://issues.apache.org/jira/browse/TIKA-591 Project: Tika Issue Type: Improvement Components: parser Reporter: Jukka Zitting Assignee: Jukka Zitting Priority: Minor As a followup to TIKA-416, it would be good to implement at least optional support for a separate launcher process for the ForkParser feature. The need for such an extra process came up in JCR-2864 where a reference to http://developers.sun.com/solaris/articles/subprocess/subprocess.html was made. To summarize, the problem is that the ProcessBuilder.start() call can result in a temporary duplication of the memory space of the parent JVM. Even with copy-on-write semantics this can be a fairly expensive operation and prone to out-of-memory issues especially in large-scale deployments where the parent JVM already uses the majority of the available RAM on a computer. A similar problem is also being discussed at HADOOP-5059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-715: - Labels: newbie (was: ) Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Labels: newbie Fix For: 1.8 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p at
[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements
[ https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342528#comment-14342528 ] Tyler Palsulich commented on TIKA-465: -- [~kkrugler], I commented in case someone else had more context. So, if you're happy to close, I am too. LanguageIdentifier API enhancements --- Key: TIKA-465 URL: https://issues.apache.org/jira/browse/TIKA-465 Project: Tika Issue Type: Improvement Components: languageidentifier Reporter: Chris A. Mattmann Assignee: Ken Krugler Priority: Minor As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements for the LanguageIdentifier that we should consider in Tika: {quote} More informations can be found on the following thread on Nutch-Dev mailing list: http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html Summary: 1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code. 2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity(). {quote} I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch since LanguageIdentification is something that can happen in Tika-ville... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-774: - Labels: features new-parser newbie patch (was: features newbie patch,) ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, new-parser, newbie, patch Fix For: 1.8 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342605#comment-14342605 ] Tyler Palsulich edited comment on TIKA-774 at 3/2/15 1:45 AM: -- Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I agree that we should create another conflicting Parser for image types. Our decision on this is related to TIKA-776. was (Author: tpalsulich): Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I agree that we should create another conflicting Parser for image types. ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, new-parser, newbie, patch Fix For: 1.8 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-289: Attachment: file-mimes-missing.txt Attached is the list of mime types extracted from the file(1) magic directory (found from grepping for {{!:mime}}) which aren't found in the Tika Mimetypes file Some of these will be aliases for ones we already have, some will be aliases where we also need to bring over magic, and some will be new ones This list doesn't include any for which we have a mime type, but don't currently have any magic Add magic byte patterns from file(1) Key: TIKA-289 URL: https://issues.apache.org/jira/browse/TIKA-289 Project: Tika Issue Type: Improvement Components: mime Reporter: Jukka Zitting Priority: Minor Attachments: file-mimes-missing.txt As discussed in TIKA-285, the file(1) command comes with a pretty comprehensive set of magic byte patterns. It would be nice to get those patterns included also in Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-289: Attachment: file-has-magic-tika-missing.txt {{file-has-magic-tika-missing.txt}} is the list of mime types where file(1) has magic but Tika does not, where both know about the same mime type. Note that there may be some false positives on this list, eg where Tika has the magic on a parent type Add magic byte patterns from file(1) Key: TIKA-289 URL: https://issues.apache.org/jira/browse/TIKA-289 Project: Tika Issue Type: Improvement Components: mime Reporter: Jukka Zitting Priority: Minor Attachments: file-has-magic-tika-missing.txt, file-mimes-missing.txt As discussed in TIKA-285, the file(1) command comes with a pretty comprehensive set of magic byte patterns. It would be nice to get those patterns included also in Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs
[ https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342232#comment-14342232 ] Luis Filipe Nassif commented on TIKA-591: - I think this is very important. We are having problems on Linux that I think are related to this while running the TesseractOCRParser. Sometimes the trace is similar to those posted in HADOOP-5059, sometimes it is outside of TesseractOCRParser, but I think it is related to a memory corruption caused by an early fork/exec. Reducing the max heap of the JVM helps a bit, but does not solve the issue. I don't know the tika-batch code, is it possible to use CompositeParser directly with tika-batch? Separate launcer process for forking JVMs - Key: TIKA-591 URL: https://issues.apache.org/jira/browse/TIKA-591 Project: Tika Issue Type: Improvement Components: parser Reporter: Jukka Zitting Assignee: Jukka Zitting Priority: Minor As a followup to TIKA-416, it would be good to implement at least optional support for a separate launcher process for the ForkParser feature. The need for such an extra process came up in JCR-2864 where a reference to http://developers.sun.com/solaris/articles/subprocess/subprocess.html was made. To summarize, the problem is that the ProcessBuilder.start() call can result in a temporary duplication of the memory space of the parent JVM. Even with copy-on-write semantics this can be a fairly expensive operation and prone to out-of-memory issues especially in large-scale deployments where the parent JVM already uses the majority of the available RAM on a computer. A similar problem is also being discussed at HADOOP-5059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: FeedBack required for Geographic Parser
On Sat, 28 Feb 2015, Gautham Shankar wrote: My progress has been updated on the below link. https://wiki.apache.org/tika/TikaGeographicInformationParser I would like you guys to comment on the Key Names that i have come up for customized Meta data, this could certainly be shortened. Ideally, we try not to invent our own metadata keys, but instead re-use definitions/standards from elsewhere. We also try to map format-specific keys onto general ones, to keep things consistent between different file types From a quick glance, it looks like a few of the metadata entris you have are ones which could be mapped onto an existing key, and a few could be mapped onto new metadata properties from external standards It might also be worth looking at some of the other scientific formats, and see if any commonality can be found with those / they can be changed to be common. Where there's a concept that's the same, the different formats should try to use the same metadata key. (As an example, as a user, you don't need to know if a file format uses Created On, Created At, First Created At, Created, or anything like that, you just fetch dc:created and it's the same thing across all formats, and you can go look up the Dublin Core specification if you want to check what it means semantically) Nick
[jira] [Updated] (TIKA-443) Geographic Information Parser
[ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-443: - Labels: new-parser (was: ) Geographic Information Parser - Key: TIKA-443 URL: https://issues.apache.org/jira/browse/TIKA-443 Project: Tika Issue Type: New Feature Components: parser Reporter: Arturo Beltran Assignee: Chris A. Mattmann Labels: new-parser Attachments: getFDOMetadata.xml I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases. If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342526#comment-14342526 ] Tyler Palsulich commented on TIKA-715: -- List of parser tests that fail after applying the patch: {code} AutoDetectParserTest.testKeynote:164-assertAutoDetect:148-assertAutoDetect:132-assertAutoDetect:99 null AutoDetectParserTest.testPages:169-assertAutoDetect:148-assertAutoDetect:132-assertAutoDetect:99 mismatched elements open=div close=body AutoDetectParserTest.testZipBombPrevention:271 mismatched elements open=p close=div iBooksParserTest.testiBooksParser:40 mismatched elements open=title close=head IWorkParserTest.testKeynoteBulletPoints:115 null IWorkParserTest.testKeynoteMasterSlideTable:140 mismatched elements open=tr close=table IWorkParserTest.testKeynoteTables:127 null IWorkParserTest.testKeynoteTextBoxes:103 null IWorkParserTest.testPagesLayoutMode:204 mismatched elements open=div close=body IWorkParserTest.testParseKeynote:57 null IWorkParserTest.testParsePages:154 mismatched elements open=div close=body IWorkParserTest.testParsePagesHeadersAlphaLower:406 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersAlphaUpper:385 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersFootersFootnotes:316 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersFootersRomanLower:364 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersFootersRomanUpper:343 mismatched elements open=p close=div RFC822ParserTest.testEncryptedZipAttachment:277 null RFC822ParserTest.testMultipart:93 null RFC822ParserTest.testNormalZipAttachment:332 null RFC822ParserTest.testUnusualFromAddress:197 null MboxParserTest.testComplex:150 null ExcelParserTest.testExcel95:380 end tag=body with no startElement WordParserTest.testControlCharacter:383-TikaTest.getXML:114-TikaTest.getXML:123 mismatched elements open=a close=b OOXMLParserTest.testTextInsideTextBox:971-TikaTest.getXML:114-TikaTest.getXML:123 null ODFParserTest.testFromFile:342 null ODFParserTest.testOO3:58 null ODFParserTest.testOO3Metadata:218 null {code} Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.8 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at
[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements
[ https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342525#comment-14342525 ] Ken Krugler commented on TIKA-465: -- I'm actually working on a new language detector, so I think this can be closed. LanguageIdentifier API enhancements --- Key: TIKA-465 URL: https://issues.apache.org/jira/browse/TIKA-465 Project: Tika Issue Type: Improvement Components: languageidentifier Reporter: Chris A. Mattmann Assignee: Ken Krugler Priority: Minor As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements for the LanguageIdentifier that we should consider in Tika: {quote} More informations can be found on the following thread on Nutch-Dev mailing list: http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html Summary: 1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code. 2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity(). {quote} I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch since LanguageIdentification is something that can happen in Tika-ville... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-465) LanguageIdentifier API enhancements
[ https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler closed TIKA-465. Resolution: Won't Fix The change to the API to return more information about the detected languages is still interesting, but I think it makes more sense to look at using a different detector (e.g. language-detector/detection) versus improving the internal version that was ported from Nutch back in the day. LanguageIdentifier API enhancements --- Key: TIKA-465 URL: https://issues.apache.org/jira/browse/TIKA-465 Project: Tika Issue Type: Improvement Components: languageidentifier Reporter: Chris A. Mattmann Assignee: Ken Krugler Priority: Minor As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements for the LanguageIdentifier that we should consider in Tika: {quote} More informations can be found on the following thread on Nutch-Dev mailing list: http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html Summary: 1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code. 2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity(). {quote} I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch since LanguageIdentification is something that can happen in Tika-ville... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-539: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Encoding detection is too biased by encoding in meta tag Key: TIKA-539 URL: https://issues.apache.org/jira/browse/TIKA-539 Project: Tika Issue Type: Improvement Components: metadata, parser Affects Versions: 0.8, 0.9, 0.10 Reporter: Reinhard Schwab Assignee: Ken Krugler Priority: Minor Fix For: 1.8 Attachments: TIKA-539.patch, TIKA-539_2.patch if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = htmlhead\n + meta http-equiv=\content-type\ content=\application/xhtml+xml; charset=iso-8859-1\ / + /headbodyÜber den Wolken\n/body/html; /** * @param args * @throws IOException * @throws TikaException * @throws SAXException */ public static void main(String[] args) throws IOException, SAXException, TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/html); metadata.set(Metadata.CONTENT_ENCODING, UTF-8); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes(UTF-8)); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(1); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs
[ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342532#comment-14342532 ] Tyler Palsulich commented on TIKA-723: -- The default of behavior of Tika still prints out a letter or two per p tag. But, did we decide that this isn't a problem, since users can turn on sort by position? Rotated text isn't extracted correctly from PDFs Key: TIKA-723 URL: https://issues.apache.org/jira/browse/TIKA-723 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: rotated.pdf I have an example PDF with 90 degree rotation; Tika produces the characters one line at a time. Ie, the doc has Some rotated text, here! but Tika produces this: {noformat} bodydiv class=pagepSo m e r o t a t e d t e x t , h e r e !/p {noformat} I'm able to copy/paste the text out correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reopened TIKA-539: -- Encoding detection is too biased by encoding in meta tag Key: TIKA-539 URL: https://issues.apache.org/jira/browse/TIKA-539 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 0.8, 0.9, 0.10 Reporter: Reinhard Schwab Assignee: Ken Krugler Fix For: 1.8 Attachments: TIKA-539.patch, TIKA-539_2.patch if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = htmlhead\n + meta http-equiv=\content-type\ content=\application/xhtml+xml; charset=iso-8859-1\ / + /headbodyÜber den Wolken\n/body/html; /** * @param args * @throws IOException * @throws TikaException * @throws SAXException */ public static void main(String[] args) throws IOException, SAXException, TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/html); metadata.set(Metadata.CONTENT_ENCODING, UTF-8); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes(UTF-8)); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(1); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342547#comment-14342547 ] Ray Gauss II commented on TIKA-634: --- Also see the [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg]. There we recently had to patch {{ExternalParser}} for some stream parsing concurrency problems which should be raised in a separate issue here shortly. Command Line Parser for Metadata Extraction --- Key: TIKA-634 URL: https://issues.apache.org/jira/browse/TIKA-634 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.9 Reporter: Nick Burch Assignee: Nick Burch Priority: Minor As discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E This issue is to track improvements in the ExternalParser support to handle metadata extraction, and probably easier configuration of an external parser too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-765) add icu dependency
[ https://issues.apache.org/jira/browse/TIKA-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-765. Resolution: Won't Fix Closing as Won't Fix since the Persian character issues seem to be solved. add icu dependency -- Key: TIKA-765 URL: https://issues.apache.org/jira/browse/TIKA-765 Project: Tika Issue Type: Improvement Components: general Affects Versions: 0.10 Reporter: Robert Muir Spinoff of TIKA-713. In PDFBox, reflection is used to detect if ICU is available in the classpath: if it is, then it can use ICU BiDi support to properly extract right-to-left text. otherwise, the text is returned backwards. This is because the JDK does not provide the functionality needed to do this inverse BiDI reordering / arabic-unshaping. it would be nice to properly depend on this, so that these languages work out of box... we do this in Apache Solr's tika integration (contrib/extraction) for example. Unlike the charset detection code from ICU that tika includes, including BiDi support would be trickier, because it uses datafiles built from unicode (These change over time and would be a hassle to maintain). Additionally as a note: Tika has some forked charset code from ICU... long term it would be great to get those changes into ICU as well. Finally as an optimization its possible to reduce the icu4j jar size if needed with http://apps.icu-project.org/datacustom/, but maybe as a start we could just depend upon the 'whole' icu? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-852) Quicktime / MP4 Metadata Parser
[ https://issues.apache.org/jira/browse/TIKA-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-852: - Labels: new-parser (was: ) Quicktime / MP4 Metadata Parser --- Key: TIKA-852 URL: https://issues.apache.org/jira/browse/TIKA-852 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.0 Reporter: Nick Burch Assignee: Nick Burch Labels: new-parser Attachments: TIKA-852.patch From the investigations done for TIKA-851, it looks like a parser for the Quicktime format, and MP4 (which is an extension to it) shouldn't be too hard to do. This should be able to output some of the media metadata, such duration, dimensions, and MP4 audio tags Information resources on the format are linked from TIKA-851 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-879: - Labels: new-parser (was: ) Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov Labels: new-parser Attachments: TIKA-879-thunderbird.eml When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342670#comment-14342670 ] Tyler Palsulich commented on TIKA-879: -- [~lfcnassif], that seems like a reasonable solution. [~gagravarr], any objections to widening the range of the offset for magic detection? Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov Labels: new-parser Attachments: TIKA-879-thunderbird.eml When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-893) Tika-server bundle includes wrong META-INF/services/org.apache.tika.parser.Parser, doesn't work
[ https://issues.apache.org/jira/browse/TIKA-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342705#comment-14342705 ] Tyler Palsulich commented on TIKA-893: -- Is this still an issue? From what I understand, all service files are parsed so that all services are loaded? Tika-server bundle includes wrong META-INF/services/org.apache.tika.parser.Parser, doesn't work --- Key: TIKA-893 URL: https://issues.apache.org/jira/browse/TIKA-893 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.1, 1.2 Environment: Apache Maven 2.2.1 (rdebian-6) Java version: 1.6.0_26 Java home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre Default locale: en_GB, platform encoding: UTF-8 OS name: linux version: 3.0.0-17-generic-pae arch: i386 Family: unix Reporter: Chris Wilson Labels: maven, patch Both vorbis-java-tika-0.1.jar and tika-parsers-1.1-SNAPSHOT.jar include different copies of META-INF/services/org.apache.tika.parser.Parser, which the auto-detecting parser needs to configure itself. AFAIK, only one of these can be included in a standalone OSGi JAR, as they both have the same filename. On my system at least, the vorbis one gets included in the JAR, and not the tika-parsers one. This means that the Tika server is capable of auto-detecting Vorbis files, but not Microsoft Office files, which is completely broken from my POV. Unless the (undocumented) Bnd contains some way to merge these files, I suggest either: * excluding the one from vorbis-java-tika (easy but removes Vorbis auto-detection); * bundling vorbis-java-tika as an embedded JAR instead of inlined (might work?); * including a manually combined copy of both manifests in tika-server/src/main/resources (ugly, requires maintenance); * finding or writing a maven plugin to merge these files (outside my maven-fu). My simple workaround, which probably removes Vorbis support completely, is this patch: {code:xml|title=tika-server/pom.xml.patch} @@ -163,7 +168,7 @@ instructions Export-Packageorg.apache.tika.*/Export-Package Embed-Dependency - !jersey-server;scope=compile;inline=META-INF/services/**|au/**|javax/**|org/**|com/**|Resources/**|font_metrics.properties|repackage/**|schema*/**, + !jersey-server;artifactId=!vorbis-java-tika;scope=compile;inline=META-INF/services/**|au/**|javax/**|org/**|com/**|Resources/**|font_metrics.properties|repackage/**|schema*/**, jersey-server;scope=compile;inline=com/** |META-INF/services/com.sun*|META-INF/services/javax.ws.rs.ext.RuntimeDelegate /Embed-Dependency {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-891: - Labels: newbie (was: ) Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Labels: newbie Fix For: 1.8 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-903) NPE thrown with password protected Pages file
[ https://issues.apache.org/jira/browse/TIKA-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-903. Resolution: Fixed No exception is thrown with Tika 1.8-SNAPSHOT. So, closing as fixed. NPE thrown with password protected Pages file - Key: TIKA-903 URL: https://issues.apache.org/jira/browse/TIKA-903 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Environment: Windows 7 Reporter: Gabriel Valencia Labels: iWork, nullpointerexception Attachments: testPagesVariousPwdProtected.pages When trying to view a password-protected Pages file in Tika GUI, you get an NPE: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.iwork.IWorkPackageParser@30583058 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) at javax.swing.TransferHandler.importData(TransferHandler.java:756) at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1479) at java.awt.dnd.DropTarget.drop(DropTarget.java:445) at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1204) at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:531) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:844) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:768) at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:42) at java.awt.Component.dispatchEventImpl(Component.java:4498) at java.awt.Container.dispatchEventImpl(Container.java:2110) at java.awt.Component.dispatchEvent(Component.java:4471) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4588) at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4323) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4174) at java.awt.Container.dispatchEventImpl(Container.java:2096) at java.awt.Window.dispatchEventImpl(Window.java:2490) at java.awt.Component.dispatchEvent(Component.java:4471) at java.awt.EventQueue.dispatchEvent(EventQueue.java:610) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:280) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:195) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:185) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:180) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:172) at java.awt.EventDispatchThread.run(EventDispatchThread.java:133) Caused by: java.lang.NullPointerException at org.apache.tika.parser.iwork.IWorkPackageParser$IWORKDocumentType.detectType(IWorkPackageParser.java:125) at org.apache.tika.parser.iwork.IWorkPackageParser$IWORKDocumentType.access$000(IWorkPackageParser.java:71) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:166) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 30 more I tried viewing the contents in 7-zip, but it tells me it can't understand the compression format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342718#comment-14342718 ] Chris A. Mattmann commented on TIKA-891: I think it would be nice to convert the other PUT ones to POST where it makes sense. Do you have a list in mind? Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Labels: newbie Fix For: 1.8 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-836) parsing really slow on some documents
[ https://issues.apache.org/jira/browse/TIKA-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-836. Resolution: Cannot Reproduce We can't reproduce this without the problem files. If you still have them, please upload them and reopen! parsing really slow on some documents - Key: TIKA-836 URL: https://issues.apache.org/jira/browse/TIKA-836 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.0 Environment: CentOS 4.x/5.x/6.x Reporter: Rob Tulloh We are seeing that tika sometimes takes a very long time to parse some content (likely PDF). For example, with the following EML file that contains 4 documents (2 PDF, 1 MS Excel, 1 text): {noformat} fgrep --binary-file=text Content-Type: XXX.eml Content-Type: multipart/mixed; Content-Type: multipart/alternative; Content-Type: text/plain; Content-Type: text/html; Content-Type: application/octet-stream; Content-Type: application/octet-stream; Content-Type: application/vnd.ms-excel; du -sh XXX.eml 6.0MXXX.eml {noformat} Note that it takes tika nearly 30 minutes to process this content even though the source is only 6M in size: {noformat} time java -Xmx2G -jar ../../tika-app-1.0.jar -m XXX.eml meta.out WARN - Did not found XRef object at specified startxref position 230521 WARN - Did not found XRef object at specified startxref position 3742379 real29m16.913s user18m17.050s sys 0m19.465s {noformat} Is there any way to configure tika (in particular via solr) to process files more quickly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-862) JPSS HDF5 files not being detected appropriately
[ https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-862. -- Resolution: Fixed Marking as fixed. The output from the above file is {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=Mission_Name content=NPP/ meta name=Content-Length content=20888/ meta name=Distributor content=noaa/ meta name=N_HDF_Creation_Date content=2022/ meta name=N_HDF_Creation_Time content=203300.301515Z/ meta name=N_Collection_Short_Name content=SPACECRAFT-DIARY-RDR/ meta name=Instrument_Short_Name content=SPACECRAFT/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.hdf.HDFParser/ meta name=Platform_Short_Name content=NPP/ meta name=N_Dataset_Source content=noaa/ meta name=N_Dataset_Type_Tag content=RDR/ meta name=N_Processing_Domain content=ops/ meta name=Content-Type content=application/x-hdf/ meta name=resourceName content=test.h5/ title/ /head body//html {code} JPSS HDF5 files not being detected appropriately Key: TIKA-862 URL: https://issues.apache.org/jira/browse/TIKA-862 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Richard Yu Assignee: Chris A. Mattmann Attachments: ASF.LICENSE.NOT.GRANTED--RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, ASF.LICENSE.NOT.GRANTED--RNSCA-ROLPS_npp_d20120202_t1841338_e1842112_b01382_c20120202203730692328_noaa_ops.h5, RNSCA_npp_d2021_t1935200_e1935400_b00346_c2022203300301515_noaa_ops.h5 As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by Tika. See this: from [~minfing]: {quote} We were trying to extract metadata from our h5 file (i.e. with JPSS extension). We ran the following command line: {noformat} [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \ /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 Content-Encoding: windows-1252 Content-Length: 22187952 Content-Type: text/plain resourceName: SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5 [ryu@localhost hdf5extractor]$ {noformat} We noticed that the content type in text/plain and only 4 lines of output (i.e. we expected al lots of metadata). Let me know if more information is needed. Thanks! Richard {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-849) Identify and parse the Apple iBooks format
[ https://issues.apache.org/jira/browse/TIKA-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-849: - Labels: new-parser (was: ) Identify and parse the Apple iBooks format -- Key: TIKA-849 URL: https://issues.apache.org/jira/browse/TIKA-849 Project: Tika Issue Type: New Feature Components: mime, parser Affects Versions: 1.1 Reporter: Andrew Jackson Labels: new-parser Attachments: ibooks-support.patch With the release of iBooks Author 1.0, Apple have created a new eBook format very similar to ePub. Tika could be extended to identify and parse this new format, re-using the existing ePub code wherever possible. I have created an initial patch, which I will attach to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-858: - Labels: new-parser (was: ) Tika add parsing support for ANPA-1312 news wire feeds -- Key: TIKA-858 URL: https://issues.apache.org/jira/browse/TIKA-858 Project: Tika Issue Type: New Feature Components: mime, parser Affects Versions: 0.10 Reporter: Craig Stires Labels: new-parser Attachments: 7901V5.pdf, IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch This submission adds support for ANPA-1312 news wire feeds. Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts. This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-858) Tika add parsing support for ANPA-1312 news wire feeds
[ https://issues.apache.org/jira/browse/TIKA-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342664#comment-14342664 ] Tyler Palsulich commented on TIKA-858: -- Does anyone have an ANPA file we can use to test? Tika add parsing support for ANPA-1312 news wire feeds -- Key: TIKA-858 URL: https://issues.apache.org/jira/browse/TIKA-858 Project: Tika Issue Type: New Feature Components: mime, parser Affects Versions: 0.10 Reporter: Craig Stires Labels: new-parser Attachments: 7901V5.pdf, IptcAnpaParser.java, org.apache.tika.parser.Parser_ANPA.patch, tika-mimetypes_ANPA.patch This submission adds support for ANPA-1312 news wire feeds. Those feeds are the formats used by AP, AFP, NYT, Reuters in their daily news wire broadcasts. This was a pretty significant development effort, so am happy to share back as a thank you to the TIKA community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-880) while integrating microsoft parser it is giving error
[ https://issues.apache.org/jira/browse/TIKA-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342672#comment-14342672 ] Tyler Palsulich commented on TIKA-880: -- Hi [~som.mukhopadhyay]. Thank you for raising this issue. Apologies for not getting any attention on it. How exactly were you integrating Parsers? Were you ever able to resolve this? I'm going to close this as Can't Reproduce later this week if not. while integrating microsoft parser it is giving error - Key: TIKA-880 URL: https://issues.apache.org/jira/browse/TIKA-880 Project: Tika Issue Type: Wish Components: parser Affects Versions: 1.0 Environment: Android Reporter: Somenath Mukhopadhyay Labels: newbie Original Estimate: 12h Remaining Estimate: 12h i don't know if i should raise this problem as an issue in the Jira. but i have reached a roadblock. I am using Apache Tika for my Android developement. I was successful in integrating most of the parsers. however, when i am trying to integrate Microsoft and Microsoft.ooxml, it is giving the most dreaded Conversion to Dalvik format failed with error 1. if someone can help me out in resolving this issue, that will be really fantastic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-887) Tika fails to parse some MP3 tags correctly and produces null characters in value
[ https://issues.apache.org/jira/browse/TIKA-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-887. -- Resolution: Fixed No objection and the linked file seemed to have valid metadata. So I'm marking this as fixed. Tika fails to parse some MP3 tags correctly and produces null characters in value - Key: TIKA-887 URL: https://issues.apache.org/jira/browse/TIKA-887 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0, 1.1 Reporter: Jens Hübel Priority: Minor I have a problem when extracting the comment tag from an MP3 file. It contains an invalid prefix then a '\0' character and then the real value of the tag. This happpens with files downloaded from www.jamendo.com, for example this one: http://storage.newjamendo.com/download/track/450545/mp32/Swansong.mp3 It may be that the tags are not created properly on this site, but at least tools like mp3tag display them correctly. The extracted value looks like this: eng http://www.jamendo.com Attribution-Noncommercial-Share Alike 3.0 At position 3 there is a null character. The tag value should start with http... Here is the byte sequence at the beginning of this file: 49 44 33 04 00 00 00 01 18 32 54 49 54 32 00 00 00 09 00 00 03 53 77 61 6E 73 6F 6E 67 54 50 45 31 00 00 00 0E 00 00 03 4A 6F 73 68 20 57 6F 6F 64 77 61 72 64 54 41 4C 42 00 00 00 0C 00 00 03 42 72 65 61 64 63 72 75 6D 62 73 54 44 52 4C 00 00 00 05 00 00 03 32 30 30 39 43 4F 4D 4D 00 00 00 22 00 00 03 65 6E 67 49 44 33 20 76 31 20 43 6F 6D 6D 65 6E 74 00 41 74 74 72 69 62 75 74 69 6F 6E 20 33 2E 30 54 43 4F 4E 00 00 00 06 00 00 03 28 32 35 35 29 54 50 55 42 00 00 00 08 00 00 03 4A 61 6D 65 6E 64 6F 43 4F 4D 4D 00 00 00 2C 00 00 03 65 6E 67 00 68 74 74 70 3A 2F 2F 77 77 77 2E 6A 61 6D 65 6E 64 6F 2E 63 6F 6D 20 41 74 74 72 69 62 75 74 69 6F 6E 20 33 2E 30 20 54 43 4F 50 00 00 01 1F 00 00 03 32 30 30 39 2D 31 30 2D 32 31 54 31 31 3A 31 31 3A 32 30 2B 30 31 3A 30 30 20 4A 6F 73 68 20 57 6F 6F 64 77 61 72 64 2E 20 4C 69 63 65 6E 73 65 64 20 74 6F 20 74 68 ID3..2TIT2...SwansongTPE1...Josh WoodwardTALB...BreadcrumbsTDRL...2009COMM..engID3 v1 Comment.Attribution 3.0TCON...(255)TPUB...JamendoCOMM...,...eng.http://www.jamendo.com Attribution 3.0 TCOP...2009-10-21T11:11:20+01:00 Josh Woodward. Licensed to th -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-888. Resolution: Fixed Tika is now using Java 1.6 (talking about 1.7) and there were some Java 1.5 compatibility updates making tests pass. Marking as Fixed. NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5 -- Key: TIKA-888 URL: https://issues.apache.org/jira/browse/TIKA-888 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Uwe Schindler Assignee: Chris A. Mattmann Lucene/Solr developers ran this tool before releasing Lucene/Solr 3.6 (Solr 3.6 is still required to run on Java 1.5, see SOLR-3295): http://code.google.com/p/versioncheck/ {noformat} Major.Minor Version : 50.0 JAVA compatibility : Java 1.6 platform: 45.3-50.0 Number of classes : 60 Classes are: c:\Work\lucene-solr\.\solr\contrib\extraction\lib\netcdf-4.2-min.jar [:] ucar/unidata/geoloc/Bearing.class ... {noformat} TIKA should use a 1.5 version of this class and especially do some Java 5 tests before releasing (as it's build dependencies says, it's minimum Java5). I tried to compile and run TIKA tests with Java 1.5 - crash (Invalid class file format). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342703#comment-14342703 ] Tyler Palsulich commented on TIKA-891: -- I made a couple changes related to this for TIKA-1547 (use POST for forms). Should we still convert the other PUT resources to POST? Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Fix For: 1.8 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Curating Issues
On Sun, 1 Mar 2015, Tyler Palsulich wrote: I've started labeling some issues as new-parser and newbie. I think these should be helpful for organization. Please let me know if there is another label we've already been using for those. I put new-parser on any requests to support a new filetype, even if it doesn't require a full on Parser (e.g. just magic). I don't know if anyone has the time to mentor, but there's just about still time to get something into GSoC for 2015. If we do have someone who could mentor a student in the summer, then it could be worth tagging any summer sized issues with gsoc2015. http://community.apache.org/gsoc.html has some more info for anyone new to gsoc Nick
[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342709#comment-14342709 ] Tyler Palsulich commented on TIKA-894: -- [~lewismc], if you have the time, this would be great to have. Add webapp mode for Tika Server, simplifies deployment -- Key: TIKA-894 URL: https://issues.apache.org/jira/browse/TIKA-894 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.1, 1.2 Reporter: Chris Wilson Labels: maven, newbie, patch Fix For: 1.8 Attachments: tika-server-webapp.patch For use in production services, Tika Server should really be deployed as a WAR file, under a reliable servlet container that knows how to run as a system service, for example Tomcat or JBoss. This is especially important on Windows, where I wasted an entire day trying to make TikaServerCli run as some kind of a service. Maven makes building a webapp pretty trivial. With the attached patch applied, mvn war:war should work. It seems to run fine in Tomcat, which makes Windows deployment much simpler. Just install Tomcat and drop the WAR file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-897) UTF-8 encoded XML is detected as text/plain because of UTF-8 BOM
[ https://issues.apache.org/jira/browse/TIKA-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-897. Resolution: Fixed Closing as fixed per Nick's comment above. We can open a new issue if someone wants UTF-32 XML detection support. UTF-8 encoded XML is detected as text/plain because of UTF-8 BOM Key: TIKA-897 URL: https://issues.apache.org/jira/browse/TIKA-897 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Reporter: Wade Taylor Priority: Minor Detection of XML fails when encoded as UTF-8. The UTF-8 BOM: 0xEF,0xBB,0xBF causes the XML detector to fail when trying to match ?xml at the beginning of the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case
[ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-911: - Affects Version/s: (was: 1.1) 1.8 Converted PDF document contains question marks in place of spaces and inconsistent case --- Key: TIKA-911 URL: https://issues.apache.org/jira/browse/TIKA-911 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.8 Reporter: Matt Sheppard Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using {code} $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf {code} Produces substantially worse output than xpdf's pdftotext program. Specifically, we see... Some 'spaces' replaced with question marks {noformat} ... bodydiv class=pagep/ pHow can I help? When you're overseas: • ?wherever?possible,?don't?visit?crops?—?contact?with? /p pgrowing?crops?greatly?increases?the?risk?of?contaminating? footwear?or?clothing;? ... {noformat} and some odd case conversions {noformat} pstem rust in wheat. (soURce: BRAd collIs)/p p/ /div {noformat} (The original document seems to contain SOURCE: BRAD COLLIS all in upper case. To compare that with pdftotext {code} $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf {code} This does not output the question marks, and produces Source: BRAD COLLIS at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case
[ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342716#comment-14342716 ] Tyler Palsulich commented on TIKA-911: -- Still seeing this issue (question marks instead of spaces) on a Mac with Tika 1.8-SNAPSHOT. {{mvn -version}}: {code} Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T16:58:10-04:00) Maven home: /usr/local/Cellar/maven/3.2.3/libexec Java version: 1.7.0_71, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: mac os x, version: 10.10.2, arch: x86_64, family: mac {code} Converted PDF document contains question marks in place of spaces and inconsistent case --- Key: TIKA-911 URL: https://issues.apache.org/jira/browse/TIKA-911 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.8 Reporter: Matt Sheppard Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using {code} $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf {code} Produces substantially worse output than xpdf's pdftotext program. Specifically, we see... Some 'spaces' replaced with question marks {noformat} ... bodydiv class=pagep/ pHow can I help? When you're overseas: • ?wherever?possible,?don't?visit?crops?—?contact?with? /p pgrowing?crops?greatly?increases?the?risk?of?contaminating? footwear?or?clothing;? ... {noformat} and some odd case conversions {noformat} pstem rust in wheat. (soURce: BRAd collIs)/p p/ /div {noformat} (The original document seems to contain SOURCE: BRAD COLLIS all in upper case. To compare that with pdftotext {code} $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf {code} This does not output the question marks, and produces Source: BRAD COLLIS at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-885) Possible ConcurrentModificationException while accessing Metadata produced by ParsingReader
[ https://issues.apache.org/jira/browse/TIKA-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342675#comment-14342675 ] Tyler Palsulich commented on TIKA-885: -- [~lfcnassif], is this issue superseded by TIKA-1007? Or, should we keep this open? Possible ConcurrentModificationException while accessing Metadata produced by ParsingReader --- Key: TIKA-885 URL: https://issues.apache.org/jira/browse/TIKA-885 Project: Tika Issue Type: Improvement Components: metadata, parser Affects Versions: 1.0 Environment: jre 1.6_25 x64 and Windows7 Enterprise x64 Reporter: Luis Filipe Nassif Priority: Minor Labels: patch Oracle PipedReader and PipedWriter classes have a bug that do not allow them to execute concurrently, because they notify each other only when the pipe is full or empty, and do not after a char is read or written to the pipe. So i modified ParsingReader to use modified versions of PipedReader and PipedWriter, similar to gnu versions of them, that work concurrently. However, sometimes and with certain files, i am getting the following error: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(Unknown Source) at java.util.HashMap$KeyIterator.next(Unknown Source) at java.util.AbstractCollection.toArray(Unknown Source) at org.apache.tika.metadata.Metadata.names(Metadata.java:146) It is because the ParsingReader.ParsingTask thread is writing metadata while it is being read by the ParsingReader thread, with files containing metadata beyond its initial bytes. It will not occur with the current implementation, because java PipedReader and PipedWriter block each other, what is a performance bug that affect ParsingReader, but they could be fixed in a future java release. I think it would be a defensive approach to turn access to the private Metadata.metadata Map synchronized, what could avoid a possible future problem using ParsingReader. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-899) [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension
[ https://issues.apache.org/jira/browse/TIKA-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-899. Resolution: Duplicate [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension - Key: TIKA-899 URL: https://issues.apache.org/jira/browse/TIKA-899 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Claudiu Muresan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-898) [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension
[ https://issues.apache.org/jira/browse/TIKA-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-898. Resolution: Cannot Reproduce There are a few ways to configure available Parsers. You can use the new blacklist feature, configuration in TIKA-1509, or pull out the underlying dependencies. Closing this as Cannot Reproduce, since I'm not sure what the exact issue is. [Jackrabbit 2.4 - Tika Parser 1.0] How to configure AutoDetectParser for not detecting content when using files without extension - Key: TIKA-898 URL: https://issues.apache.org/jira/browse/TIKA-898 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Claudiu Muresan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342722#comment-14342722 ] Tyler Palsulich commented on TIKA-891: -- There are only 3 -- getText, getXML, getHTML. So, easy list. Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Labels: newbie Fix For: 1.8 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342725#comment-14342725 ] Chris A. Mattmann commented on TIKA-891: ahh if it's get anything I would recommend making them GET methods with @GET Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Labels: newbie Fix For: 1.8 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Curating Issues
Good idea, Nick. The vision parser I threw up I labeled with gsoc2015 - if there are any takers, please send them my way! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, March 1, 2015 at 8:14 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Curating Issues On Sun, 1 Mar 2015, Tyler Palsulich wrote: I've started labeling some issues as new-parser and newbie. I think these should be helpful for organization. Please let me know if there is another label we've already been using for those. I put new-parser on any requests to support a new filetype, even if it doesn't require a full on Parser (e.g. just magic). I don't know if anyone has the time to mentor, but there's just about still time to get something into GSoC for 2015. If we do have someone who could mentor a student in the summer, then it could be worth tagging any summer sized issues with gsoc2015. http://community.apache.org/gsoc.html has some more info for anyone new to gsoc Nick
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342738#comment-14342738 ] Nick Burch commented on TIKA-879: - It might be good to try the widened versions with Tika Batch, to see if on a wide range of files it causes any noticable slowdown or false positives? I still think this isn't a file format that can be fully reliably detected with mime magic alone, and ideally we do need a dedicated detector for it as mentioned above, to fully solve this and related (eg multipart/signed) detection Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov Labels: new-parser Attachments: TIKA-879-thunderbird.eml When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-634) Command Line Parser for Metadata Extraction
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-634: Labels: new-parser (was: ) Command Line Parser for Metadata Extraction --- Key: TIKA-634 URL: https://issues.apache.org/jira/browse/TIKA-634 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.9 Reporter: Nick Burch Assignee: Nick Burch Priority: Minor Labels: new-parser As discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E This issue is to track improvements in the ExternalParser support to handle metadata extraction, and probably easier configuration of an external parser too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342743#comment-14342743 ] Nick Burch commented on TIKA-634: - We still seem to lack proper unit tests for {{ExternalParser}} in the Tika Core module, so I think it needs to stay open until some are added, and until Ray is happy it's all working fine for ffmpeg as well! Command Line Parser for Metadata Extraction --- Key: TIKA-634 URL: https://issues.apache.org/jira/browse/TIKA-634 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.9 Reporter: Nick Burch Assignee: Nick Burch Priority: Minor Labels: new-parser As discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E This issue is to track improvements in the ExternalParser support to handle metadata extraction, and probably easier configuration of an external parser too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342749#comment-14342749 ] Nick Burch commented on TIKA-675: - I think this is already handled by the RecursiveParserWrapper, via the EMBEDDED_RESOURCE_PATH metadata key? PackageExtractor should track names of recursively nested resources --- Key: TIKA-675 URL: https://issues.apache.org/jira/browse/TIKA-675 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.10 Reporter: Andrzej Bialecki When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-712) Master slide text isn't extracted
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342750#comment-14342750 ] Nick Burch commented on TIKA-712: - I think it might already be as fixed as it can be? It isn't perfect, as POI's HSLF can't detect the boilerplate text there yet, but otherwise I think it's pretty much there. [~mikemccand] can hopefully confirm? Master slide text isn't extracted - Key: TIKA-712 URL: https://issues.apache.org/jira/browse/TIKA-712 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Attachments: TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx It looks like we are not getting text from the master slide for PPT and PPTX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-727) Improve the outputed XHTML by HSLFExtractor
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-727. - Resolution: Fixed I believe this has been fixed for some time, so I'm closing it. If you still have this problem, please re-open the bug and attach a small test file which shows the problem! Improve the outputed XHTML by HSLFExtractor --- Key: TIKA-727 URL: https://issues.apache.org/jira/browse/TIKA-727 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.10 Reporter: Pablo Queixalos Priority: Minor Attachments: HSLFExtractor.java, HSLFExtractor.patch The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-770) New ODF metadata keys
[ https://issues.apache.org/jira/browse/TIKA-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342764#comment-14342764 ] Nick Burch commented on TIKA-770: - I think this probably wants to be a Tika 2.0 fix. We have some other metadata keys in there which have also been deprecated, so it's probably best to remove them all in one go in Tika 2.0, rather than remove a few now, and the rest later, to avoid confusion New ODF metadata keys - Key: TIKA-770 URL: https://issues.apache.org/jira/browse/TIKA-770 Project: Tika Issue Type: Improvement Components: metadata, parser Reporter: Jukka Zitting Priority: Minor Labels: odf Followup from TIKA-764. {quote} The 2nd step is to add a few extra common keys for the stats that ODF has that aren't covered, then remove the non standard keys {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1531) Upgrade to POI 3.12-beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342768#comment-14342768 ] Nick Burch commented on TIKA-1531: -- Apache POI 3.12 beta 1 was released over the weekend, in case anyone wants to tackle this! Upgrade to POI 3.12-beta1 when available Key: TIKA-1531 URL: https://issues.apache.org/jira/browse/TIKA-1531 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Opening this issue to track integration items with POI 3.12-beta1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Curating Issues
I'll keep GSOC in mind. We should also start labeling issues with 2.0. Tyler On Mar 1, 2015 11:39 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Good idea, Nick. The vision parser I threw up I labeled with gsoc2015 - if there are any takers, please send them my way! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Sunday, March 1, 2015 at 8:14 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Curating Issues On Sun, 1 Mar 2015, Tyler Palsulich wrote: I've started labeling some issues as new-parser and newbie. I think these should be helpful for organization. Please let me know if there is another label we've already been using for those. I put new-parser on any requests to support a new filetype, even if it doesn't require a full on Parser (e.g. just magic). I don't know if anyone has the time to mentor, but there's just about still time to get something into GSoC for 2015. If we do have someone who could mentor a student in the summer, then it could be worth tagging any summer sized issues with gsoc2015. http://community.apache.org/gsoc.html has some more info for anyone new to gsoc Nick
Re: Curating Issues
On Mon, 2 Mar 2015, Tyler Palsulich wrote: I'll keep GSOC in mind. We should also start labeling issues with 2.0. I think we only have a few issues for that currently, mostly around metadata keys, but it may grow! As a reminder for everyone, don't forget we've got a wiki page at https://wiki.apache.org/tika/Tika2_0RoadMap to track the things we need a major version number bump to change/break Nick
[jira] [Updated] (TIKA-912) Response charset encoding not declared, and depends on host OS (Windows/Linux)
[ https://issues.apache.org/jira/browse/TIKA-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-912: - Attachment: TIKA-912.palsulich.patch Attached an updated patch which adds charset info to each {{@Produces}} annotation. If no one objects, I'll commit it this week. Response charset encoding not declared, and depends on host OS (Windows/Linux) -- Key: TIKA-912 URL: https://issues.apache.org/jira/browse/TIKA-912 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.1 Environment: java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode) java version 1.6.0_31 Java(TM) SE Runtime Environment (build 1.6.0_31-b05) Java HotSpot(TM) Client VM (build 20.6-b01, mixed mode, sharing) Reporter: Chris Wilson Labels: newbie, patch Attachments: TIKA-912.palsulich.patch, TikaResource-utf8-response.patch, TikaResource.java.patch When the response to the /tika servlet contains non-ASCII characters, Tika doesn't tell us what encoding it's using, and the encoding differs depending on which OS the server is running on. This is a server running on Tomcat on Linux: {code} chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T documents/fixtures/smartquote-bullet.docx http://localhost:8080/tika/tika | hexdump -C 48 54 54 50 2f 31 2e 31 20 31 30 30 20 43 6f 6e |HTTP/1.1 100 Con| 0010 74 69 6e 75 65 0d 0a 0d 0a 48 54 54 50 2f 31 2e |tinueHTTP/1.| 0020 31 20 32 30 30 20 4f 4b 0d 0a 53 65 72 76 65 72 |1 200 OK..Server| 0030 3a 20 41 70 61 63 68 65 2d 43 6f 79 6f 74 65 2f |: Apache-Coyote/| 0040 31 2e 31 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 70 |1.1..Content-Typ| 0050 65 3a 20 74 65 78 74 2f 70 6c 61 69 6e 0d 0a 54 |e: text/plain..T| 0060 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e 67 |ransfer-Encoding| 0070 3a 20 63 68 75 6e 6b 65 64 0d 0a 44 61 74 65 3a |: chunked..Date:| 0080 20 46 72 69 2c 20 30 34 20 4d 61 79 20 32 30 31 | Fri, 04 May 201| 0090 32 20 31 39 3a 34 30 3a 35 34 20 47 4d 54 0d 0a |2 19:40:54 GMT..| 00a0 0d 0a e2 80 99 0a e2 80 a2 09 0a |...| 00ab {code} And this is a server running on Tomcat on Windows: {code} chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T documents/fixtures/smartquote-bullet.docx http://localhost:9080/tika/tika | hexdump -C 48 54 54 50 2f 31 2e 31 20 31 30 30 20 43 6f 6e |HTTP/1.1 100 Con| 0010 74 69 6e 75 65 0d 0a 0d 0a 48 54 54 50 2f 31 2e |tinueHTTP/1.| 0020 31 20 32 30 30 20 4f 4b 0d 0a 53 65 72 76 65 72 |1 200 OK..Server| 0030 3a 20 41 70 61 63 68 65 2d 43 6f 79 6f 74 65 2f |: Apache-Coyote/| 0040 31 2e 31 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 70 |1.1..Content-Typ| 0050 65 3a 20 74 65 78 74 2f 70 6c 61 69 6e 0d 0a 54 |e: text/plain..T| 0060 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e 67 |ransfer-Encoding| 0070 3a 20 63 68 75 6e 6b 65 64 0d 0a 44 61 74 65 3a |: chunked..Date:| 0080 20 46 72 69 2c 20 30 34 20 4d 61 79 20 32 30 31 | Fri, 04 May 201| 0090 32 20 31 39 3a 33 39 3a 35 32 20 47 4d 54 0d 0a |2 19:39:52 GMT..| 00a0 0d 0a 92 0a 95 09 0a |...| 00a7 {code} As you can see, the data (last few bytes) is encoded differently. The Linux server encodes it as UTF-8, while Windows is using something strange, probably Windows-1252, where 0x92 is a curly quote and 0x95 is a bullet point. A client can't know what encoding the server used, because the Content-Type is just text/plain with no encoding. Ideally I would like it to use UTF-8 always, so that the client doesn't have to do extra work to decode it. The attached patch does that, and declares it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-613) PDF parser is changing letters positions
[ https://issues.apache.org/jira/browse/TIKA-613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-613. Resolution: Fixed Significant PDF updates within Tika and PDFBox since this issue. Can reopen if it's still a problem. PDF parser is changing letters positions Key: TIKA-613 URL: https://issues.apache.org/jira/browse/TIKA-613 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Environment: running Tika inside VB.NET 2010 with IKVM Reporter: Alex Labels: parser, pdf The pdf parser is changing the position of some letters and adding spaces inside the text. For example: Parsed text O fluox de caixa e os ganhos econmô icos referentes à estocagem dos RSD no aterro sanitário Original O fluxo de caixa e os ganhos econômicos referentes à estocagem dos RSD no aterro sanitário I`ve parsed the same text with iTextsharp and the result was Ok. The original pdf file is here: http://www.teses.usp.br/teses/disponiveis/8/8135/tde-04072008-113118/publico/DISSERTACAO_JOSE_EDUARDO_ABBAS.pdf [UPDATE] It looks like the changing positions is solved in the new version (1.0), but there is some spaces between text: Parsed Os processos econômicos e polít icos causadores da atual forma de geração dos RSD Original Os processos econômicos e políticos causadores da atual forma de geração dos RSD -- This message was sent by Atlassian JIRA (v6.3.4#6332)