[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-817: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 (PPT/PPTX) Missing date/time in text content. - Key: TIKA-817 URL: https://issues.apache.org/jira/browse/TIKA-817 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.2 Missing date/time text in text content for PPT and PPTX files. The date and time are missing from the text content. This occurs when one chooses the following with MS-PowerPoint 2010: 1) Insert 2) Date Time 3) Update automatically 4) save to PPT or PPTX -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-861) Parse links in PDF
[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-861: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Parse links in PDF -- Key: TIKA-861 URL: https://issues.apache.org/jira/browse/TIKA-861 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Reporter: Sasha Goodman Priority: Minor Labels: links, pdfbox Fix For: 1.2 Original Estimate: 4h Remaining Estimate: 4h Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. The PDF2XHTML method loops through the annotations. See: {code:java} 136: for(Object o : page.getAnnotations()) { {code} I found some code for dealing with links in annotations: http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link It involves checking the class. {code:java} if( annotation instanceof PDAnnotationLink ) { PDAnnotationLink link = (PDAnnotationLink)annotation; {code} I hope this helps someone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding
[ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-868: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 TXT parser does not honour the specified encoding - Key: TIKA-868 URL: https://issues.apache.org/jira/browse/TIKA-868 Project: Tika Issue Type: Bug Reporter: Daniel Bonniot de Ruisselet Fix For: 1.2 With input text Indanyl, the encoding is recognized as IBM500, even when UTF-8 is specified explicitly. I would argue that detection should only be used when the declared information is incorrect (saving time and avoiding wrong detection), as proposed by Ken Krugler in TIKA-539. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.2 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p
[jira] [Updated] (TIKA-816) (XLS/XLSX) Improperly formatted date/time in text content.
[ https://issues.apache.org/jira/browse/TIKA-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-816: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 (XLS/XLSX) Improperly formatted date/time in text content. -- Key: TIKA-816 URL: https://issues.apache.org/jira/browse/TIKA-816 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.0 Environment: Win7-64 + java version 1.6.0_26 Reporter: Albert L. Fix For: 1.2 Improperly formated text content for XLS and XLSX files. The date and time are not formatted as date/time data but rather floating point numbers. This occurs for cells with the content as =now() or =today(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-605) Tika GDAL parser
[ https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-605: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Tika GDAL parser Key: TIKA-605 URL: https://issues.apache.org/jira/browse/TIKA-605 Project: Tika Issue Type: New Feature Components: parser Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: gdal, integration, tika Fix For: 1.2 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, TIKA-605.Mattmann.092511.patch.txt Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser around GDAL. See here: http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-819: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Make Option to Exclude Embedded Files' Text for Text Content Key: TIKA-819 URL: https://issues.apache.org/jira/browse/TIKA-819 Project: Tika Issue Type: New Feature Components: general Affects Versions: 1.0 Environment: Windows-7 + JDK 1.6 u26 Reporter: Albert L. Fix For: 1.2 It would be nice to be able to disable text content from embedded files. For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-758: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Address TODOs when we upgrade to next PDFBox release Key: TIKA-758 URL: https://issues.apache.org/jira/browse/TIKA-758 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Fix For: 1.2 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-776) ExifTool Embedder
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-776: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 ExifTool Embedder - Key: TIKA-776 URL: https://issues.apache.org/jira/browse/TIKA-776 Project: Tika Issue Type: New Feature Components: metadata Affects Versions: 1.0 Environment: ExifTool is required (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: embed, exiftool, patch Fix For: 1.2 Attachments: tika-parsers-exiftool-embed-patch.txt This patch adds an ExifTool ExternalEmbedder which builds upon the work in issue TIKA-774 and TIKA-775. In the tika-parsers an ExiftoolExternalEmbedder is added which extends ExternalEmbedder to programmatically create an Embedder which calls the ExifTool command line to embed tika metadata into a file stream and an ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and XMP fields then parses the resulting file stream to verify the operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-820) Locator is unset for HTML parser
[ https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-820: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Locator is unset for HTML parser Key: TIKA-820 URL: https://issues.apache.org/jira/browse/TIKA-820 Project: Tika Issue Type: Bug Components: general, parser Affects Versions: 1.0 Reporter: Daniel Bonniot de Ruisselet Labels: patch Fix For: 1.2 Attachments: text-locator.patch The HtmlParser does not call setDocumentLocator(Locator locator) on the user's content handler. Patch and unit test attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler
[ https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-754: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler -- Key: TIKA-754 URL: https://issues.apache.org/jira/browse/TIKA-754 Project: Tika Issue Type: Improvement Affects Versions: 0.10, 1.0 Reporter: Pablo Queixalos Priority: Minor Fix For: 1.2 Attachments: TIKA-754.poc.patch As seen with some parsers (PDF, PPT), some text blocks still contains text carriage returns ('\n') in the outputted XHTML. A global fix for this could be located in XHTMLContentHandler.characters(...). By analyzing the given char array, when a '\n' char is encountered insert a BR element instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-775) Embed Capabilities
[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-775: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Embed Capabilities -- Key: TIKA-775 URL: https://issues.apache.org/jira/browse/TIKA-775 Project: Tika Issue Type: Improvement Components: general, metadata Affects Versions: 1.0 Environment: The default ExternalEmbedder requires that sed be installed. Reporter: Ray Gauss II Labels: embed, patch Fix For: 1.2 Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt This patch defines and implements the concept of embedding tika metadata into a file stream, the reverse of extraction. In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder implementation meant to be extended or configured are added. These classes are essentially a reverse flow of the existing Parser and ExternalParser classes. In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then verify the operation by parsing the resulting stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-593) Tika network server
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-593: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Tika network server --- Key: TIKA-593 URL: https://issues.apache.org/jira/browse/TIKA-593 Project: Tika Issue Type: New Feature Components: general Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Chris A. Mattmann Fix For: 1.2 Attachments: TIKA-593_pom.diff It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container. I'd like to be able to set up and run such a server like this: $ java -jar tika-app.jar --port 1234 We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects
[ https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-859: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 DublinCore Metadata Keys Should be Prefixed and Property Objects Key: TIKA-859 URL: https://issues.apache.org/jira/browse/TIKA-859 Project: Tika Issue Type: Improvement Components: metadata Affects Versions: 1.1 Reporter: Ray Gauss II Fix For: 1.2 Attachments: dublincore-prefixed-patch.diff To help avoid collisions of key names in interfaces Metadata implements and allow for more precise definition of DublinCore the keys should be defined as Property objects with the object name and name attribute containing a prefix and the existing String keys deprecated, i.e. {code:title=DublinCore.java} String SUBJECT = subject; {code} would become: {code:title=DublinCore.java} @Deprecated String SUBJECT = subject; Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + subject); {code} Since the use of the simpler key definition is desired eventually, at some point in the future, perhaps 2.0, these prefixed definitions could themselves be deprecated and the move made back to the simpler names. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-774: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, newbie, patch, Fix For: 1.2 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library
[ https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-842: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 IPTC Properties Should be Defined Completely and Independently of the Drew Library -- Key: TIKA-842 URL: https://issues.apache.org/jira/browse/TIKA-842 Project: Tika Issue Type: Improvement Components: metadata Affects Versions: 1.0 Reporter: Ray Gauss II Fix For: 1.2 Attachments: IPTC-metadata-def-patch.diff, iptc-dublincore-aliased-patch.diff, metadata-remove-iptc-patch.diff All of the IPTC XMP specification should be defined in tika-core and should not be reliant on the Drew Noakes library as it is incomplete in its support of the standard and the properties are not defined in proper namespaces or prefixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
buildbot failure in ASF Buildbot on tika-trunk
The Buildbot has detected a new failure on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/751 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: isis_ubuntu Build Reason: scheduler Build Source Stamp: [branch tika/trunk] 1297992 Blamelist: mattmann BUILD FAILED: failed svn sincerely, -The Buildbot
[jira] [Created] (TIKA-869) IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name
IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name - Key: TIKA-869 URL: https://issues.apache.org/jira/browse/TIKA-869 Project: Tika Issue Type: Bug Reporter: Ken Krugler Assignee: Ken Krugler Currently IdentityHtmlMapper.mapSafeElement(String name) just returns name as-is. This makes the XHTMLContentHandler think that it hasn't received a body tag, since it assumes input is lower-cased. So you get output that looks like: bodyBODY//body/html The solution is a trivial change to lower-case the incoming name, the same as what the mapSafeAttribute() method is already doing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-869) IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name
[ https://issues.apache.org/jira/browse/TIKA-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler updated TIKA-869: - Attachment: TIKA-869.patch IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name - Key: TIKA-869 URL: https://issues.apache.org/jira/browse/TIKA-869 Project: Tika Issue Type: Bug Reporter: Ken Krugler Assignee: Ken Krugler Attachments: TIKA-869.patch Currently IdentityHtmlMapper.mapSafeElement(String name) just returns name as-is. This makes the XHTMLContentHandler think that it hasn't received a body tag, since it assumes input is lower-cased. So you get output that looks like: bodyBODY//body/html The solution is a trivial change to lower-case the incoming name, the same as what the mapSafeAttribute() method is already doing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call
Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call - Key: TIKA-870 URL: https://issues.apache.org/jira/browse/TIKA-870 Project: Tika Issue Type: Improvement Reporter: Shay Banon It would be great to be able to call parseToString with an additional parameter of the maxStringLength, instead of having to set it on the Tika instance. This allows to set it per parse call. Sample code: {code} public String parseToString(InputStream stream, Metadata metadata, int maxStringLength) throws IOException, TikaException { WriteOutContentHandler handler = new WriteOutContentHandler(maxStringLength); try { ParseContext context = new ParseContext(); context.set(Parser.class, parser); parser.parse( stream, new BodyContentHandler(handler), metadata, context); } catch (SAXException e) { if (!handler.isWriteLimitReached(e)) { // This should never happen with BodyContentHandler... throw new TikaException(Unexpected SAX processing failure, e); } } finally { stream.close(); } return handler.toString(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call
[ https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-870: --- Assignee: Michael McCandless Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call - Key: TIKA-870 URL: https://issues.apache.org/jira/browse/TIKA-870 Project: Tika Issue Type: Improvement Reporter: Shay Banon Assignee: Michael McCandless It would be great to be able to call parseToString with an additional parameter of the maxStringLength, instead of having to set it on the Tika instance. This allows to set it per parse call. Sample code: {code} public String parseToString(InputStream stream, Metadata metadata, int maxStringLength) throws IOException, TikaException { WriteOutContentHandler handler = new WriteOutContentHandler(maxStringLength); try { ParseContext context = new ParseContext(); context.set(Parser.class, parser); parser.parse( stream, new BodyContentHandler(handler), metadata, context); } catch (SAXException e) { if (!handler.isWriteLimitReached(e)) { // This should never happen with BodyContentHandler... throw new TikaException(Unexpected SAX processing failure, e); } } finally { stream.close(); } return handler.toString(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call
[ https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224643#comment-13224643 ] Michael McCandless commented on TIKA-870: - I think this makes sense. Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call - Key: TIKA-870 URL: https://issues.apache.org/jira/browse/TIKA-870 Project: Tika Issue Type: Improvement Reporter: Shay Banon Assignee: Michael McCandless It would be great to be able to call parseToString with an additional parameter of the maxStringLength, instead of having to set it on the Tika instance. This allows to set it per parse call. Sample code: {code} public String parseToString(InputStream stream, Metadata metadata, int maxStringLength) throws IOException, TikaException { WriteOutContentHandler handler = new WriteOutContentHandler(maxStringLength); try { ParseContext context = new ParseContext(); context.set(Parser.class, parser); parser.parse( stream, new BodyContentHandler(handler), metadata, context); } catch (SAXException e) { if (!handler.isWriteLimitReached(e)) { // This should never happen with BodyContentHandler... throw new TikaException(Unexpected SAX processing failure, e); } } finally { stream.close(); } return handler.toString(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call
[ https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-870: Attachment: TIKA-870.patch Patch, with the sample code plus a test case. The test case failed at first! Ie, the returned string was over the specified limit... I dug and discovered WriteOutContentHandler wasn't overriding/counting ignorableWhitespace, so I added that override and now the test passes. I think it's ready... Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call - Key: TIKA-870 URL: https://issues.apache.org/jira/browse/TIKA-870 Project: Tika Issue Type: Improvement Reporter: Shay Banon Assignee: Michael McCandless Attachments: TIKA-870.patch It would be great to be able to call parseToString with an additional parameter of the maxStringLength, instead of having to set it on the Tika instance. This allows to set it per parse call. Sample code: {code} public String parseToString(InputStream stream, Metadata metadata, int maxStringLength) throws IOException, TikaException { WriteOutContentHandler handler = new WriteOutContentHandler(maxStringLength); try { ParseContext context = new ParseContext(); context.set(Parser.class, parser); parser.parse( stream, new BodyContentHandler(handler), metadata, context); } catch (SAXException e) { if (!handler.isWriteLimitReached(e)) { // This should never happen with BodyContentHandler... throw new TikaException(Unexpected SAX processing failure, e); } } finally { stream.close(); } return handler.toString(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[VOTE] Apache Tika 1.1 release rc #1
Hi Folks, A candidate for the Tika 1.1 release is available at: http://people.apache.org/~mattmann/apache-tika-1.1/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.1/ The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df. Please vote on releasing this package as Apache Tika 1.1. The vote is open for at least the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.1 [ ] -1 Do not release this package because... Thanks! Cheers, Chris P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects
[ https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-859: -- Attachment: dublincore-prefixed-and-updated-references-parsers-patch dublincore-prefixed-and-updated-references-core-patch Patches for core and parsers which deprecates existing DublinCore String metadata names and adds prefixed metadata Property objects as the last patch here did, but also updates all references to the now deprecated metadata names to their Property counterparts and adds a few convenience methods in Metadata for working with Property objects as keys. DublinCore Metadata Keys Should be Prefixed and Property Objects Key: TIKA-859 URL: https://issues.apache.org/jira/browse/TIKA-859 Project: Tika Issue Type: Improvement Components: metadata Affects Versions: 1.1 Reporter: Ray Gauss II Fix For: 1.2 Attachments: dublincore-prefixed-and-updated-references-core-patch, dublincore-prefixed-and-updated-references-parsers-patch To help avoid collisions of key names in interfaces Metadata implements and allow for more precise definition of DublinCore the keys should be defined as Property objects with the object name and name attribute containing a prefix and the existing String keys deprecated, i.e. {code:title=DublinCore.java} String SUBJECT = subject; {code} would become: {code:title=DublinCore.java} @Deprecated String SUBJECT = subject; Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + subject); {code} Since the use of the simpler key definition is desired eventually, at some point in the future, perhaps 2.0, these prefixed definitions could themselves be deprecated and the move made back to the simpler names. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects
[ https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-859: -- Attachment: (was: dublincore-prefixed-patch.diff) DublinCore Metadata Keys Should be Prefixed and Property Objects Key: TIKA-859 URL: https://issues.apache.org/jira/browse/TIKA-859 Project: Tika Issue Type: Improvement Components: metadata Affects Versions: 1.1 Reporter: Ray Gauss II Fix For: 1.2 Attachments: dublincore-prefixed-and-updated-references-core-patch, dublincore-prefixed-and-updated-references-parsers-patch To help avoid collisions of key names in interfaces Metadata implements and allow for more precise definition of DublinCore the keys should be defined as Property objects with the object name and name attribute containing a prefix and the existing String keys deprecated, i.e. {code:title=DublinCore.java} String SUBJECT = subject; {code} would become: {code:title=DublinCore.java} @Deprecated String SUBJECT = subject; Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + subject); {code} Since the use of the simpler key definition is desired eventually, at some point in the future, perhaps 2.0, these prefixed definitions could themselves be deprecated and the move made back to the simpler names. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Apache Tika 1.1 release rc #1
Hi guys, Congrats for the v1.1 rc1. Compile fine for me (OSX Lion 10.7.3 + OSX Snow Leopard 10.8.6). All test passed. +1 Regards, Zabrane On Mar 7, 2012, at 10:35 PM, Mattmann, Chris A (388J) wrote: Hi Folks, A candidate for the Tika 1.1 release is available at: http://people.apache.org/~mattmann/apache-tika-1.1/rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.1/ The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df. Please vote on releasing this package as Apache Tika 1.1. The vote is open for at least the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.1 [ ] -1 Do not release this package because... Thanks! Cheers, Chris P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Tika 1.1 release rc #1
Hi Chris, On Mar 7, 2012, at 1:35pm, Mattmann, Chris A (388J) wrote: Hi Folks, A candidate for the Tika 1.1 release is available at: http://people.apache.org/~mattmann/apache-tika-1.1/rc1/ I'm curious why you've got just the tika-app-1.1.jar (plus release sources), and not any of the other artifacts? I was hoping to grab the jars, do a manual mvn install onto my Mac, and then try them out with some web crawling code. I can of course build from source, but it seems like that adds another potential delta between the artifacts that get released and what I'm testing. Thanks, -- Ken The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.1/ The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df. Please vote on releasing this package as Apache Tika 1.1. The vote is open for at least the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.1 [ ] -1 Do not release this package because... Thanks! Cheers, Chris P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Ken Krugler http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Mahout Solr
Re: [VOTE] Apache Tika 1.1 release rc #1
Hey Ken, Sorry about that! Forgot to include the link to the staged Maven2 repo, here: https://repository.apache.org/content/repositories/orgapachetika-066/ There ya go. Cheers, Chris On Mar 7, 2012, at 4:36 PM, Ken Krugler wrote: Hi Chris, On Mar 7, 2012, at 1:35pm, Mattmann, Chris A (388J) wrote: Hi Folks, A candidate for the Tika 1.1 release is available at: http://people.apache.org/~mattmann/apache-tika-1.1/rc1/ I'm curious why you've got just the tika-app-1.1.jar (plus release sources), and not any of the other artifacts? I was hoping to grab the jars, do a manual mvn install onto my Mac, and then try them out with some web crawling code. I can of course build from source, but it seems like that adds another potential delta between the artifacts that get released and what I'm testing. Thanks, -- Ken The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.1/ The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df. Please vote on releasing this package as Apache Tika 1.1. The vote is open for at least the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.1 [ ] -1 Do not release this package because... Thanks! Cheers, Chris P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Ken Krugler http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Mahout Solr ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++