date:20120307

[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-817:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 (PPT/PPTX) Missing date/time in text content.
 -

 Key: TIKA-817
 URL: https://issues.apache.org/jira/browse/TIKA-817
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.2


 Missing date/time text in text content for PPT and PPTX files.
 The date and time are missing from the text content.  This occurs when one 
 chooses the following with MS-PowerPoint 2010:
 1) Insert
 2) Date  Time
 3) Update automatically
 4) save to PPT or PPTX

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-861) Parse links in PDF

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-861:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Parse links in PDF
 --

 Key: TIKA-861
 URL: https://issues.apache.org/jira/browse/TIKA-861
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.0
Reporter: Sasha Goodman
Priority: Minor
  Labels: links, pdfbox
 Fix For: 1.2

   Original Estimate: 4h
  Remaining Estimate: 4h

 Currently the XHTML doesn't contain links, although PDFBox parses them. I'm 
 new to Tika and haven't done java for 6 years, but someone more experienced 
 could probably do this in a few hours. 
 The PDF2XHTML method loops through the annotations. 
 See: 
 {code:java}
 136: for(Object o : page.getAnnotations()) {
 {code}
  I found some code for dealing with links in annotations:
 http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
 It involves checking the class. 
 {code:java}
 if( annotation instanceof PDAnnotationLink ) {
 PDAnnotationLink link = (PDAnnotationLink)annotation;
 {code}
 I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-868:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 TXT parser does not honour the specified encoding
 -

 Key: TIKA-868
 URL: https://issues.apache.org/jira/browse/TIKA-868
 Project: Tika
  Issue Type: Bug
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.2


 With input text Indanyl, the encoding is recognized as IBM500, even when 
 UTF-8 is specified explicitly.
 I would argue that detection should only be used when the declared 
 information is incorrect (saving time and avoiding wrong detection), as 
 proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.2

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p

[jira] [Updated] (TIKA-816) (XLS/XLSX) Improperly formatted date/time in text content.

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-816:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 (XLS/XLSX) Improperly formatted date/time in text content.
 --

 Key: TIKA-816
 URL: https://issues.apache.org/jira/browse/TIKA-816
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.2


 Improperly formated text content for XLS and XLSX files.
 The date and time are not formatted as date/time data but rather floating 
 point numbers.  This occurs for cells with the content as =now() or 
 =today().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-605) Tika GDAL parser

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Tika GDAL parser
 

 Key: TIKA-605
 URL: https://issues.apache.org/jira/browse/TIKA-605
 Project: Tika
  Issue Type: New Feature
  Components: parser
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: gdal, integration, tika
 Fix For: 1.2

 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
 TIKA-605.Mattmann.092511.patch.txt


 Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
 around GDAL. See here: 
 http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Make Option to Exclude Embedded Files' Text for Text Content
 

 Key: TIKA-819
 URL: https://issues.apache.org/jira/browse/TIKA-819
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 1.0
 Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
 Fix For: 1.2


 It would be nice to be able to disable text content from embedded files.
 For example, if I have a DOCX with an embedded PPTX, then I would like the 
 option to disable text from the PPTX from showing up when asking for the text 
 content from DOCX.  In other words, it would be nice to have the option to 
 get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-758:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Address TODOs when we upgrade to next PDFBox release
 

 Key: TIKA-758
 URL: https://issues.apache.org/jira/browse/TIKA-758
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 1.2


 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
 the code when we next upgrade PDFBox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-776) ExifTool Embedder

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 ExifTool Embedder
 -

 Key: TIKA-776
 URL: https://issues.apache.org/jira/browse/TIKA-776
 Project: Tika
  Issue Type: New Feature
  Components: metadata
Affects Versions: 1.0
 Environment: ExifTool is required 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: embed, exiftool, patch
 Fix For: 1.2

 Attachments: tika-parsers-exiftool-embed-patch.txt


 This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
 issue TIKA-774 and TIKA-775.
 In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
 ExternalEmbedder to programmatically create an Embedder which calls the 
 ExifTool command line to embed tika metadata into a file stream and an 
 ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
 XMP fields then parses the resulting file stream to verify the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-820) Locator is unset for HTML parser

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-820:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Locator is unset for HTML parser
 

 Key: TIKA-820
 URL: https://issues.apache.org/jira/browse/TIKA-820
 Project: Tika
  Issue Type: Bug
  Components: general, parser
Affects Versions: 1.0
Reporter: Daniel Bonniot de Ruisselet
  Labels: patch
 Fix For: 1.2

 Attachments: text-locator.patch


 The HtmlParser does not call setDocumentLocator(Locator locator) on the 
 user's content handler.
 Patch and unit test attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-754:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Automatic line break insertion (BR element) instead of '\n' in 
 XHTMLContentHandler
 --

 Key: TIKA-754
 URL: https://issues.apache.org/jira/browse/TIKA-754
 Project: Tika
  Issue Type: Improvement
Affects Versions: 0.10, 1.0
Reporter: Pablo Queixalos
Priority: Minor
 Fix For: 1.2

 Attachments: TIKA-754.poc.patch


 As seen with some parsers (PDF, PPT), some text blocks still contains text 
 carriage returns ('\n') in the outputted XHTML. 
 A global fix for this could be located in XHTMLContentHandler.characters(...).
 By analyzing the given char array, when a '\n' char is encountered insert a 
 BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-775) Embed Capabilities

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-775:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Embed Capabilities
 --

 Key: TIKA-775
 URL: https://issues.apache.org/jira/browse/TIKA-775
 Project: Tika
  Issue Type: Improvement
  Components: general, metadata
Affects Versions: 1.0
 Environment: The default ExternalEmbedder requires that sed be 
 installed.
Reporter: Ray Gauss II
  Labels: embed, patch
 Fix For: 1.2

 Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt


 This patch defines and implements the concept of embedding tika metadata into 
 a file stream, the reverse of extraction.
 In the tika-core project an interface defining an Embedder and a generic sed 
 ExternalEmbedder implementation meant to be extended or configured are added. 
  These classes are essentially a reverse flow of the existing Parser and 
 ExternalParser classes.
 In the tika-parsers project an ExternalEmbedderTest unit test is added which 
 uses the default ExternalEmbedder (calls sed) to embed a value placed in 
 Metadata.DESCRIPTION then verify the operation by parsing the resulting 
 stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-593) Tika network server

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-593:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Tika network server
 ---

 Key: TIKA-593
 URL: https://issues.apache.org/jira/browse/TIKA-593
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
 Fix For: 1.2

 Attachments: TIKA-593_pom.diff


 It would be cool to be able to run Tika as a network service that accepts a 
 binary document as input and produces the extracted content (as XHTML, text, 
 or just metadata) as output. A bit like TIKA-169, but without the dependency 
 to a servlet container.
 I'd like to be able to set up and run such a server like this:
 $ java -jar tika-app.jar --port 1234
 We should also add a NetworkParser class that acts as a local client for such 
 a service. This way a lightweight client could use the full set of Tika 
 parsing functionality even with just the tika-core jar within its classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-859:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 DublinCore Metadata Keys Should be Prefixed and Property Objects
 

 Key: TIKA-859
 URL: https://issues.apache.org/jira/browse/TIKA-859
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.1
Reporter: Ray Gauss II
 Fix For: 1.2

 Attachments: dublincore-prefixed-patch.diff


 To help avoid collisions of key names in interfaces Metadata implements and 
 allow for more precise definition of DublinCore the keys should be defined as 
 Property objects with the object name and name attribute containing a prefix 
 and the existing String keys deprecated, i.e.
 {code:title=DublinCore.java}
 String SUBJECT = subject;
 {code}
 would become:
 {code:title=DublinCore.java}
 @Deprecated
 String SUBJECT = subject;
 Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + 
 subject);
 {code}
 Since the use of the simpler key definition is desired eventually, at some 
 point in the future, perhaps 2.0, these prefixed definitions could themselves 
 be deprecated and the move made back to the simpler names.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-774) ExifTool Parser

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-774:
---

Fix Version/s: (was: 1.1)
1.2

- push out to 1.2

ExifTool Parser
---

Key: TIKA-774
URL: https://issues.apache.org/jira/browse/TIKA-774
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.0
Environment: Requires be installed
(http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
Labels: features, newbie, patch,
Fix For: 1.2

Attachments: testJPEG_IPTC_EXT.jpg,
tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt

Adds an external parser that calls ExifTool to extract extended metadata
fields from images and other content types.
In the core project:
An ExifTool interface is added which contains Property objects that define
the metadata fields available.
An additional Property constructor for internalTextBag type.
In the parsers project:
An ExiftoolMetadataExtractor is added which does the work of calling ExifTool
on the command line and mapping the response to tika metadata fields. This
extractor could be called instead of or in addition to the existing
ImageMetadataExtractor and JempboxExtractor under TiffParser and/or
JpegParser but those have not been changed at this time.
An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool
metadata fields to existing tika and Drew Noakes metadata fields if enabled.
An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag
implementations in XML files.
An ExifToolParserTest is added which tests several expected XMP and IPTC
metadata values in testJPEG_IPTC_EXT.jpg.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-842:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 IPTC Properties Should be Defined Completely and Independently of the Drew 
 Library
 --

 Key: TIKA-842
 URL: https://issues.apache.org/jira/browse/TIKA-842
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.0
Reporter: Ray Gauss II
 Fix For: 1.2

 Attachments: IPTC-metadata-def-patch.diff, 
 iptc-dublincore-aliased-patch.diff, metadata-remove-iptc-patch.diff


 All of the IPTC XMP specification should be defined in tika-core and should 
 not be reliant on the Drew Noakes library as it is incomplete in its support 
 of the standard and the properties are not defined in proper namespaces or 
 prefixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

buildbot failure in ASF Buildbot on tika-trunk

2012-03-07 Thread buildbot

The Buildbot has detected a new failure on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/751

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: isis_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1297992
Blamelist: mattmann

BUILD FAILED: failed svn

sincerely,
 -The Buildbot

[jira] [Created] (TIKA-869) IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name

2012-03-07 Thread Ken Krugler (Created) (JIRA)

IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name
-

 Key: TIKA-869
 URL: https://issues.apache.org/jira/browse/TIKA-869
 Project: Tika
  Issue Type: Bug
Reporter: Ken Krugler
Assignee: Ken Krugler


Currently IdentityHtmlMapper.mapSafeElement(String name) just returns name 
as-is. This makes the XHTMLContentHandler think that it hasn't received a 
body tag, since it assumes input is lower-cased. So you get output that looks 
like:

bodyBODY//body/html

The solution is a trivial change to lower-case the incoming name, the same as 
what the mapSafeAttribute() method is already doing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-869) IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name

2012-03-07 Thread Ken Krugler (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler updated TIKA-869:
-

Attachment: TIKA-869.patch

 IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name
 -

 Key: TIKA-869
 URL: https://issues.apache.org/jira/browse/TIKA-869
 Project: Tika
  Issue Type: Bug
Reporter: Ken Krugler
Assignee: Ken Krugler
 Attachments: TIKA-869.patch


 Currently IdentityHtmlMapper.mapSafeElement(String name) just returns name 
 as-is. This makes the XHTMLContentHandler think that it hasn't received a 
 body tag, since it assumes input is lower-cased. So you get output that 
 looks like:
 bodyBODY//body/html
 The solution is a trivial change to lower-case the incoming name, the same as 
 what the mapSafeAttribute() method is already doing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

2012-03-07 Thread Shay Banon (Created) (JIRA)

Allow to use call parseToString with a additional parameter of MaxStringLength, 
so it can be changed per call
-

 Key: TIKA-870
 URL: https://issues.apache.org/jira/browse/TIKA-870
 Project: Tika
  Issue Type: Improvement
Reporter: Shay Banon


It would be great to be able to call parseToString with an additional parameter 
of the maxStringLength, instead of having to set it on the Tika instance. This 
allows to set it per parse call. Sample code:

{code}
public String parseToString(InputStream stream, Metadata metadata, int 
maxStringLength)
throws IOException, TikaException {
WriteOutContentHandler handler =
new WriteOutContentHandler(maxStringLength);
try {
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(
stream, new BodyContentHandler(handler), metadata, context);
} catch (SAXException e) {
if (!handler.isWriteLimitReached(e)) {
// This should never happen with BodyContentHandler...
throw new TikaException(Unexpected SAX processing failure, e);
}
} finally {
stream.close();
}
return handler.toString();
}
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

2012-03-07 Thread Michael McCandless (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned TIKA-870:
---

Assignee: Michael McCandless

 Allow to use call parseToString with a additional parameter of 
 MaxStringLength, so it can be changed per call
 -

 Key: TIKA-870
 URL: https://issues.apache.org/jira/browse/TIKA-870
 Project: Tika
  Issue Type: Improvement
Reporter: Shay Banon
Assignee: Michael McCandless

 It would be great to be able to call parseToString with an additional 
 parameter of the maxStringLength, instead of having to set it on the Tika 
 instance. This allows to set it per parse call. Sample code:
 {code}
 public String parseToString(InputStream stream, Metadata metadata, int 
 maxStringLength)
 throws IOException, TikaException {
 WriteOutContentHandler handler =
 new WriteOutContentHandler(maxStringLength);
 try {
 ParseContext context = new ParseContext();
 context.set(Parser.class, parser);
 parser.parse(
 stream, new BodyContentHandler(handler), metadata, context);
 } catch (SAXException e) {
 if (!handler.isWriteLimitReached(e)) {
 // This should never happen with BodyContentHandler...
 throw new TikaException(Unexpected SAX processing failure, e);
 }
 } finally {
 stream.close();
 }
 return handler.toString();
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

2012-03-07 Thread Michael McCandless (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224643#comment-13224643
 ] 

Michael McCandless commented on TIKA-870:
-

I think this makes sense.

 Allow to use call parseToString with a additional parameter of 
 MaxStringLength, so it can be changed per call
 -

 Key: TIKA-870
 URL: https://issues.apache.org/jira/browse/TIKA-870
 Project: Tika
  Issue Type: Improvement
Reporter: Shay Banon
Assignee: Michael McCandless

 It would be great to be able to call parseToString with an additional 
 parameter of the maxStringLength, instead of having to set it on the Tika 
 instance. This allows to set it per parse call. Sample code:
 {code}
 public String parseToString(InputStream stream, Metadata metadata, int 
 maxStringLength)
 throws IOException, TikaException {
 WriteOutContentHandler handler =
 new WriteOutContentHandler(maxStringLength);
 try {
 ParseContext context = new ParseContext();
 context.set(Parser.class, parser);
 parser.parse(
 stream, new BodyContentHandler(handler), metadata, context);
 } catch (SAXException e) {
 if (!handler.isWriteLimitReached(e)) {
 // This should never happen with BodyContentHandler...
 throw new TikaException(Unexpected SAX processing failure, e);
 }
 } finally {
 stream.close();
 }
 return handler.toString();
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

2012-03-07 Thread Michael McCandless (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-870:


Attachment: TIKA-870.patch

Patch, with the sample code plus a test case.

The test case failed at first!  Ie, the returned string was over the specified 
limit... I dug and discovered WriteOutContentHandler wasn't overriding/counting 
ignorableWhitespace, so I added that override and now the test passes.

I think it's ready...

 Allow to use call parseToString with a additional parameter of 
 MaxStringLength, so it can be changed per call
 -

 Key: TIKA-870
 URL: https://issues.apache.org/jira/browse/TIKA-870
 Project: Tika
  Issue Type: Improvement
Reporter: Shay Banon
Assignee: Michael McCandless
 Attachments: TIKA-870.patch


 It would be great to be able to call parseToString with an additional 
 parameter of the maxStringLength, instead of having to set it on the Tika 
 instance. This allows to set it per parse call. Sample code:
 {code}
 public String parseToString(InputStream stream, Metadata metadata, int 
 maxStringLength)
 throws IOException, TikaException {
 WriteOutContentHandler handler =
 new WriteOutContentHandler(maxStringLength);
 try {
 ParseContext context = new ParseContext();
 context.set(Parser.class, parser);
 parser.parse(
 stream, new BodyContentHandler(handler), metadata, context);
 } catch (SAXException e) {
 if (!handler.isWriteLimitReached(e)) {
 // This should never happen with BodyContentHandler...
 throw new TikaException(Unexpected SAX processing failure, e);
 }
 } finally {
 stream.close();
 }
 return handler.toString();
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[VOTE] Apache Tika 1.1 release rc #1

2012-03-07 Thread Mattmann, Chris A (388J)

Hi Folks,

A candidate for the Tika 1.1 release is available at:

  http://people.apache.org/~mattmann/apache-tika-1.1/rc1/

The release candidate is a zip archive of the sources in:

   http://svn.apache.org/repos/asf/tika/tags/1.1/

The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df.

Please vote on releasing this package as Apache Tika 1.1.
The vote is open for at least the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

   [ ] +1 Release this package as Apache Tika 1.1
   [ ] -1 Do not release this package because...

Thanks!

Cheers,
Chris

P.S. Here's my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

2012-03-07 Thread Ray Gauss II (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-859:
--

Attachment: dublincore-prefixed-and-updated-references-parsers-patch
dublincore-prefixed-and-updated-references-core-patch

Patches for core and parsers which deprecates existing DublinCore String 
metadata names and adds prefixed metadata Property objects as the last patch 
here did, but also updates all references to the now deprecated metadata names 
to their Property counterparts and adds a few convenience methods in Metadata 
for working with Property objects as keys.

 DublinCore Metadata Keys Should be Prefixed and Property Objects
 

 Key: TIKA-859
 URL: https://issues.apache.org/jira/browse/TIKA-859
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.1
Reporter: Ray Gauss II
 Fix For: 1.2

 Attachments: dublincore-prefixed-and-updated-references-core-patch, 
 dublincore-prefixed-and-updated-references-parsers-patch


 To help avoid collisions of key names in interfaces Metadata implements and 
 allow for more precise definition of DublinCore the keys should be defined as 
 Property objects with the object name and name attribute containing a prefix 
 and the existing String keys deprecated, i.e.
 {code:title=DublinCore.java}
 String SUBJECT = subject;
 {code}
 would become:
 {code:title=DublinCore.java}
 @Deprecated
 String SUBJECT = subject;
 Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + 
 subject);
 {code}
 Since the use of the simpler key definition is desired eventually, at some 
 point in the future, perhaps 2.0, these prefixed definitions could themselves 
 be deprecated and the move made back to the simpler names.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

2012-03-07 Thread Ray Gauss II (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-859:
--

Attachment: (was: dublincore-prefixed-patch.diff)

 DublinCore Metadata Keys Should be Prefixed and Property Objects
 

 Key: TIKA-859
 URL: https://issues.apache.org/jira/browse/TIKA-859
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.1
Reporter: Ray Gauss II
 Fix For: 1.2

 Attachments: dublincore-prefixed-and-updated-references-core-patch, 
 dublincore-prefixed-and-updated-references-parsers-patch


 To help avoid collisions of key names in interfaces Metadata implements and 
 allow for more precise definition of DublinCore the keys should be defined as 
 Property objects with the object name and name attribute containing a prefix 
 and the existing String keys deprecated, i.e.
 {code:title=DublinCore.java}
 String SUBJECT = subject;
 {code}
 would become:
 {code:title=DublinCore.java}
 @Deprecated
 String SUBJECT = subject;
 Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + 
 subject);
 {code}
 Since the use of the simpler key definition is desired eventually, at some 
 point in the future, perhaps 2.0, these prefixed definitions could themselves 
 be deprecated and the move made back to the simpler names.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-07 Thread Zabrane Mickael

Hi guys,

Congrats for the v1.1 rc1.

Compile fine for me (OSX Lion 10.7.3 + OSX Snow Leopard 10.8.6). All test 
passed.

+1

Regards,
Zabrane

On Mar 7, 2012, at 10:35 PM, Mattmann, Chris A (388J) wrote:

 Hi Folks,
 
 A candidate for the Tika 1.1 release is available at:
 
  http://people.apache.org/~mattmann/apache-tika-1.1/rc1/
 
 The release candidate is a zip archive of the sources in:
 
   http://svn.apache.org/repos/asf/tika/tags/1.1/
 
 The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df.
 
 Please vote on releasing this package as Apache Tika 1.1.
 The vote is open for at least the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.
 
   [ ] +1 Release this package as Apache Tika 1.1
   [ ] -1 Do not release this package because...
 
 Thanks!
 
 Cheers,
 Chris
 
 P.S. Here's my +1.
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-07 Thread Ken Krugler

Hi Chris,

On Mar 7, 2012, at 1:35pm, Mattmann, Chris A (388J) wrote:

 Hi Folks,
 
 A candidate for the Tika 1.1 release is available at:
 
  http://people.apache.org/~mattmann/apache-tika-1.1/rc1/

I'm curious why you've got just the tika-app-1.1.jar (plus release sources), 
and not any of the other artifacts?

I was hoping to grab the jars, do a manual mvn install onto my Mac, and then 
try them out with some web crawling code.

I can of course build from source, but it seems like that adds another 
potential delta between the artifacts that get released and what I'm testing.

Thanks,

-- Ken


 
 The release candidate is a zip archive of the sources in:
 
   http://svn.apache.org/repos/asf/tika/tags/1.1/
 
 The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df.
 
 Please vote on releasing this package as Apache Tika 1.1.
 The vote is open for at least the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.
 
   [ ] +1 Release this package as Apache Tika 1.1
   [ ] -1 Do not release this package because...
 
 Thanks!
 
 Cheers,
 Chris
 
 P.S. Here's my +1.
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Mahout  Solr

Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-07 Thread Mattmann, Chris A (388J)

Hey Ken,

Sorry about that! Forgot to include the link to the staged Maven2 repo, here:

https://repository.apache.org/content/repositories/orgapachetika-066/

There ya go.

Cheers,
Chris

On Mar 7, 2012, at 4:36 PM, Ken Krugler wrote:

 Hi Chris,
 
 On Mar 7, 2012, at 1:35pm, Mattmann, Chris A (388J) wrote:
 
 Hi Folks,
 
 A candidate for the Tika 1.1 release is available at:
 
 http://people.apache.org/~mattmann/apache-tika-1.1/rc1/
 
 I'm curious why you've got just the tika-app-1.1.jar (plus release sources), 
 and not any of the other artifacts?
 
 I was hoping to grab the jars, do a manual mvn install onto my Mac, and then 
 try them out with some web crawling code.
 
 I can of course build from source, but it seems like that adds another 
 potential delta between the artifacts that get released and what I'm testing.
 
 Thanks,
 
 -- Ken
 
 
 
 The release candidate is a zip archive of the sources in:
 
  http://svn.apache.org/repos/asf/tika/tags/1.1/
 
 The SHA1 checksum of the archive is d3185bb22fa3c7318488838989aff0cc9ee025df.
 
 Please vote on releasing this package as Apache Tika 1.1.
 The vote is open for at least the next 72 hours and passes if a majority of 
 at
 least three +1 Tika PMC votes are cast.
 
  [ ] +1 Release this package as Apache Tika 1.1
  [ ] -1 Do not release this package because...
 
 Thanks!
 
 Cheers,
 Chris
 
 P.S. Here's my +1.
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 --
 Ken Krugler
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Mahout  Solr
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

[jira] [Updated] (TIKA-861) Parse links in PDF

[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

[jira] [Updated] (TIKA-816) (XLS/XLSX) Improperly formatted date/time in text content.

[jira] [Updated] (TIKA-605) Tika GDAL parser

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

[jira] [Updated] (TIKA-776) ExifTool Embedder

[jira] [Updated] (TIKA-820) Locator is unset for HTML parser

[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

[jira] [Updated] (TIKA-775) Embed Capabilities

[jira] [Updated] (TIKA-593) Tika network server

[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

[jira] [Updated] (TIKA-774) ExifTool Parser

[jira] [Updated] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library

buildbot failure in ASF Buildbot on tika-trunk

[jira] [Created] (TIKA-869) IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name

[jira] [Updated] (TIKA-869) IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name

[jira] [Created] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

[jira] [Assigned] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

[jira] [Commented] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

[jira] [Updated] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

[VOTE] Apache Tika 1.1 release rc #1

[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

Re: [VOTE] Apache Tika 1.1 release rc #1

Re: [VOTE] Apache Tika 1.1 release rc #1

Re: [VOTE] Apache Tika 1.1 release rc #1

29 matches

Site Navigation

Mail list logo

Footer information