[jira] [Updated] (TIKA-593) Tika network server

2012-03-27 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-593:
---

Attachment: TIKA-593.Mattmann.032612.patch.2.txt

- ok tests passing, mostly. Will finish tomorrow morning!

 Tika network server
 ---

 Key: TIKA-593
 URL: https://issues.apache.org/jira/browse/TIKA-593
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
 Fix For: 1.2

 Attachments: TIKA-593.Mattmann.032612.patch.2.txt, 
 TIKA-593.Mattmann.032612.patch.txt, TIKA-593_pom.diff


 It would be cool to be able to run Tika as a network service that accepts a 
 binary document as input and produces the extracted content (as XHTML, text, 
 or just metadata) as output. A bit like TIKA-169, but without the dependency 
 to a servlet container.
 I'd like to be able to set up and run such a server like this:
 $ java -jar tika-app.jar --port 1234
 We should also add a NetworkParser class that acts as a local client for such 
 a service. This way a lightweight client could use the full set of Tika 
 parsing functionality even with just the tika-core jar within its classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-593) Tika network server

2012-03-27 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-593:
---

Attachment: TIKA-593.Mattmann.032712.patch.2.txt

 Tika network server
 ---

 Key: TIKA-593
 URL: https://issues.apache.org/jira/browse/TIKA-593
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
 Fix For: 1.2

 Attachments: TIKA-593.Mattmann.032612.patch.2.txt, 
 TIKA-593.Mattmann.032612.patch.txt, TIKA-593.Mattmann.032712.patch.2.txt, 
 TIKA-593.Mattmann.032712.patch.txt, TIKA-593_pom.diff


 It would be cool to be able to run Tika as a network service that accepts a 
 binary document as input and produces the extracted content (as XHTML, text, 
 or just metadata) as output. A bit like TIKA-169, but without the dependency 
 to a servlet container.
 I'd like to be able to set up and run such a server like this:
 $ java -jar tika-app.jar --port 1234
 We should also add a NetworkParser class that acts as a local client for such 
 a service. This way a lightweight client could use the full set of Tika 
 parsing functionality even with just the tika-core jar within its classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-593) Tika network server

2012-03-26 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-593:
---

Attachment: TIKA-593.Mattmann.032612.patch.txt

- Max FYI my current progress. I'm trying to get the unit tests rewritten but 
they are failing right now. Check out MetadataResource to see. The cool part is 
that we reduce a bunch of the Maven dependencies with CXF and we are eating our 
own dog food. I will go to the CXF lists tomorrow with my question about the 
failing unit tests.

 Tika network server
 ---

 Key: TIKA-593
 URL: https://issues.apache.org/jira/browse/TIKA-593
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
 Fix For: 1.2

 Attachments: TIKA-593.Mattmann.032612.patch.txt, TIKA-593_pom.diff


 It would be cool to be able to run Tika as a network service that accepts a 
 binary document as input and produces the extracted content (as XHTML, text, 
 or just metadata) as output. A bit like TIKA-169, but without the dependency 
 to a servlet container.
 I'd like to be able to set up and run such a server like this:
 $ java -jar tika-app.jar --port 1234
 We should also add a NetworkParser class that acts as a local client for such 
 a service. This way a lightweight client could use the full set of Tika 
 parsing functionality even with just the tika-core jar within its classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-874) Identify FITS (Flexible Image Transport System) files

2012-03-12 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-874:
---

Affects Version/s: (was: 1.2)
   (was: 1.1)
Fix Version/s: 1.2

- update fix version, no affects version since new feature.

 Identify FITS (Flexible Image Transport System) files
 -

 Key: TIKA-874
 URL: https://issues.apache.org/jira/browse/TIKA-874
 Project: Tika
  Issue Type: Improvement
  Components: mime
Reporter: Peter May
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.2

 Attachments: fits_support.patch


 Tika does not have a defined signature for application/fits files.  I have 
 created a patch (based on file(1) magic) to address identification of such 
 files, including a simple unit test.
 This patch only handles identification, not parsing of FITS files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-817:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 (PPT/PPTX) Missing date/time in text content.
 -

 Key: TIKA-817
 URL: https://issues.apache.org/jira/browse/TIKA-817
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.2


 Missing date/time text in text content for PPT and PPTX files.
 The date and time are missing from the text content.  This occurs when one 
 chooses the following with MS-PowerPoint 2010:
 1) Insert
 2) Date  Time
 3) Update automatically
 4) save to PPT or PPTX

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-861) Parse links in PDF

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-861:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Parse links in PDF
 --

 Key: TIKA-861
 URL: https://issues.apache.org/jira/browse/TIKA-861
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.0
Reporter: Sasha Goodman
Priority: Minor
  Labels: links, pdfbox
 Fix For: 1.2

   Original Estimate: 4h
  Remaining Estimate: 4h

 Currently the XHTML doesn't contain links, although PDFBox parses them. I'm 
 new to Tika and haven't done java for 6 years, but someone more experienced 
 could probably do this in a few hours. 
 The PDF2XHTML method loops through the annotations. 
 See: 
 {code:java}
 136: for(Object o : page.getAnnotations()) {
 {code}
  I found some code for dealing with links in annotations:
 http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
 It involves checking the class. 
 {code:java}
 if( annotation instanceof PDAnnotationLink ) {
 PDAnnotationLink link = (PDAnnotationLink)annotation;
 {code}
 I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-868:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 TXT parser does not honour the specified encoding
 -

 Key: TIKA-868
 URL: https://issues.apache.org/jira/browse/TIKA-868
 Project: Tika
  Issue Type: Bug
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.2


 With input text Indanyl, the encoding is recognized as IBM500, even when 
 UTF-8 is specified explicitly.
 I would argue that detection should only be used when the declared 
 information is incorrect (saving time and avoiding wrong detection), as 
 proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.2

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
 

[jira] [Updated] (TIKA-816) (XLS/XLSX) Improperly formatted date/time in text content.

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-816:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 (XLS/XLSX) Improperly formatted date/time in text content.
 --

 Key: TIKA-816
 URL: https://issues.apache.org/jira/browse/TIKA-816
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.0
 Environment: Win7-64 + java version 1.6.0_26
Reporter: Albert L.
 Fix For: 1.2


 Improperly formated text content for XLS and XLSX files.
 The date and time are not formatted as date/time data but rather floating 
 point numbers.  This occurs for cells with the content as =now() or 
 =today().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-605) Tika GDAL parser

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Tika GDAL parser
 

 Key: TIKA-605
 URL: https://issues.apache.org/jira/browse/TIKA-605
 Project: Tika
  Issue Type: New Feature
  Components: parser
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: gdal, integration, tika
 Fix For: 1.2

 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
 TIKA-605.Mattmann.092511.patch.txt


 Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
 around GDAL. See here: 
 http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Make Option to Exclude Embedded Files' Text for Text Content
 

 Key: TIKA-819
 URL: https://issues.apache.org/jira/browse/TIKA-819
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 1.0
 Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
 Fix For: 1.2


 It would be nice to be able to disable text content from embedded files.
 For example, if I have a DOCX with an embedded PPTX, then I would like the 
 option to disable text from the PPTX from showing up when asking for the text 
 content from DOCX.  In other words, it would be nice to have the option to 
 get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-758:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Address TODOs when we upgrade to next PDFBox release
 

 Key: TIKA-758
 URL: https://issues.apache.org/jira/browse/TIKA-758
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 1.2


 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
 the code when we next upgrade PDFBox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-776) ExifTool Embedder

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 ExifTool Embedder
 -

 Key: TIKA-776
 URL: https://issues.apache.org/jira/browse/TIKA-776
 Project: Tika
  Issue Type: New Feature
  Components: metadata
Affects Versions: 1.0
 Environment: ExifTool is required 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: embed, exiftool, patch
 Fix For: 1.2

 Attachments: tika-parsers-exiftool-embed-patch.txt


 This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
 issue TIKA-774 and TIKA-775.
 In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
 ExternalEmbedder to programmatically create an Embedder which calls the 
 ExifTool command line to embed tika metadata into a file stream and an 
 ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
 XMP fields then parses the resulting file stream to verify the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-820) Locator is unset for HTML parser

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-820:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Locator is unset for HTML parser
 

 Key: TIKA-820
 URL: https://issues.apache.org/jira/browse/TIKA-820
 Project: Tika
  Issue Type: Bug
  Components: general, parser
Affects Versions: 1.0
Reporter: Daniel Bonniot de Ruisselet
  Labels: patch
 Fix For: 1.2

 Attachments: text-locator.patch


 The HtmlParser does not call setDocumentLocator(Locator locator) on the 
 user's content handler.
 Patch and unit test attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-754:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Automatic line break insertion (BR element) instead of '\n' in 
 XHTMLContentHandler
 --

 Key: TIKA-754
 URL: https://issues.apache.org/jira/browse/TIKA-754
 Project: Tika
  Issue Type: Improvement
Affects Versions: 0.10, 1.0
Reporter: Pablo Queixalos
Priority: Minor
 Fix For: 1.2

 Attachments: TIKA-754.poc.patch


 As seen with some parsers (PDF, PPT), some text blocks still contains text 
 carriage returns ('\n') in the outputted XHTML. 
 A global fix for this could be located in XHTMLContentHandler.characters(...).
 By analyzing the given char array, when a '\n' char is encountered insert a 
 BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-775) Embed Capabilities

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-775:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Embed Capabilities
 --

 Key: TIKA-775
 URL: https://issues.apache.org/jira/browse/TIKA-775
 Project: Tika
  Issue Type: Improvement
  Components: general, metadata
Affects Versions: 1.0
 Environment: The default ExternalEmbedder requires that sed be 
 installed.
Reporter: Ray Gauss II
  Labels: embed, patch
 Fix For: 1.2

 Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt


 This patch defines and implements the concept of embedding tika metadata into 
 a file stream, the reverse of extraction.
 In the tika-core project an interface defining an Embedder and a generic sed 
 ExternalEmbedder implementation meant to be extended or configured are added. 
  These classes are essentially a reverse flow of the existing Parser and 
 ExternalParser classes.
 In the tika-parsers project an ExternalEmbedderTest unit test is added which 
 uses the default ExternalEmbedder (calls sed) to embed a value placed in 
 Metadata.DESCRIPTION then verify the operation by parsing the resulting 
 stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-593) Tika network server

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-593:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Tika network server
 ---

 Key: TIKA-593
 URL: https://issues.apache.org/jira/browse/TIKA-593
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
 Fix For: 1.2

 Attachments: TIKA-593_pom.diff


 It would be cool to be able to run Tika as a network service that accepts a 
 binary document as input and produces the extracted content (as XHTML, text, 
 or just metadata) as output. A bit like TIKA-169, but without the dependency 
 to a servlet container.
 I'd like to be able to set up and run such a server like this:
 $ java -jar tika-app.jar --port 1234
 We should also add a NetworkParser class that acts as a local client for such 
 a service. This way a lightweight client could use the full set of Tika 
 parsing functionality even with just the tika-core jar within its classpath.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-859) DublinCore Metadata Keys Should be Prefixed and Property Objects

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-859:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 DublinCore Metadata Keys Should be Prefixed and Property Objects
 

 Key: TIKA-859
 URL: https://issues.apache.org/jira/browse/TIKA-859
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.1
Reporter: Ray Gauss II
 Fix For: 1.2

 Attachments: dublincore-prefixed-patch.diff


 To help avoid collisions of key names in interfaces Metadata implements and 
 allow for more precise definition of DublinCore the keys should be defined as 
 Property objects with the object name and name attribute containing a prefix 
 and the existing String keys deprecated, i.e.
 {code:title=DublinCore.java}
 String SUBJECT = subject;
 {code}
 would become:
 {code:title=DublinCore.java}
 @Deprecated
 String SUBJECT = subject;
 Property DC_SUBJECT = Property.internalTextBag(PREFIX_DC + PREFIX_DELIMITER + 
 subject);
 {code}
 Since the use of the simpler key definition is desired eventually, at some 
 point in the future, perhaps 2.0, these prefixed definitions could themselves 
 be deprecated and the move made back to the simpler names.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-774) ExifTool Parser

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 ExifTool Parser
 ---

 Key: TIKA-774
 URL: https://issues.apache.org/jira/browse/TIKA-774
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.0
 Environment: Requires be installed 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: features, newbie, patch,
 Fix For: 1.2

 Attachments: testJPEG_IPTC_EXT.jpg, 
 tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt


 Adds an external parser that calls ExifTool to extract extended metadata 
 fields from images and other content types.
 In the core project:
 An ExifTool interface is added which contains Property objects that define 
 the metadata fields available.
 An additional Property constructor for internalTextBag type.
 In the parsers project:
 An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
 on the command line and mapping the response to tika metadata fields.  This 
 extractor could be called instead of or in addition to the existing 
 ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
 JpegParser but those have not been changed at this time.
 An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
 An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
 metadata fields to existing tika and Drew Noakes metadata fields if enabled.
 An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
 implementations in XML files.
 An ExifToolParserTest is added which tests several expected XMP and IPTC 
 metadata values in testJPEG_IPTC_EXT.jpg.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-842) IPTC Properties Should be Defined Completely and Independently of the Drew Library

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-842:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 IPTC Properties Should be Defined Completely and Independently of the Drew 
 Library
 --

 Key: TIKA-842
 URL: https://issues.apache.org/jira/browse/TIKA-842
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.0
Reporter: Ray Gauss II
 Fix For: 1.2

 Attachments: IPTC-metadata-def-patch.diff, 
 iptc-dublincore-aliased-patch.diff, metadata-remove-iptc-patch.diff


 All of the IPTC XMP specification should be defined in tika-core and should 
 not be reliant on the Drew Noakes library as it is incomplete in its support 
 of the standard and the properties are not defined in proper namespaces or 
 prefixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-862) JPSS HDF5 files not being detected appropriately

2012-02-16 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-862:
---

  Component/s: parser
Affects Version/s: 1.0

- classify and identify version (I think)

 JPSS HDF5 files not being detected appropriately
 

 Key: TIKA-862
 URL: https://issues.apache.org/jira/browse/TIKA-862
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0
Reporter: Richard Yu
Assignee: Chris A. Mattmann

 As commented in TIKA-614, JPSS HDF 5 files are not being properly detected by 
 Tika. See this:
 from [~minfing]:
 {quote}
 We were trying to extract metadata from our h5 file (i.e. with JPSS 
 extension). We ran the following command line:
 {noformat}
 [ryu@localhost hdf5extractor]$ java -jar tika-app-1.0.jar -m \
  /usr/local/staging/products/h5/SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
 Content-Encoding: windows-1252
 Content-Length: 22187952
 Content-Type: text/plain
 resourceName: 
 SVM13_npp_d20120122_t1659139_e1700381_b01225_c20120123000312144174_noaa_ops.h5
 [ryu@localhost hdf5extractor]$
 {noformat}
 We noticed that the content type in text/plain and only 4 lines of output 
 (i.e. we expected al lots of metadata).
 Let me know if more information is needed. Thanks!
 Richard
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-605) Tika GDAL parser

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Tika GDAL parser
 

 Key: TIKA-605
 URL: https://issues.apache.org/jira/browse/TIKA-605
 Project: Tika
  Issue Type: New Feature
  Components: parser
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: gdal, integration, tika
 Fix For: 1.1

 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
 TIKA-605.Mattmann.092511.patch.txt


 Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
 around GDAL. See here: 
 http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-754:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Automatic line break insertion (BR element) instead of '\n' in 
 XHTMLContentHandler
 --

 Key: TIKA-754
 URL: https://issues.apache.org/jira/browse/TIKA-754
 Project: Tika
  Issue Type: Improvement
Affects Versions: 0.10, 1.0
Reporter: Pablo Queixalos
Priority: Minor
 Fix For: 1.1

 Attachments: TIKA-754.poc.patch


 As seen with some parsers (PDF, PPT), some text blocks still contains text 
 carriage returns ('\n') in the outputted XHTML. 
 A global fix for this could be located in XHTMLContentHandler.characters(...).
 By analyzing the given char array, when a '\n' char is encountered insert a 
 BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-757:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Address TODOs when we upgrade to next POI release (3.8 beta 5)
 --

 Key: TIKA-757
 URL: https://issues.apache.org/jira/browse/TIKA-757
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 1.1


 I'm opening a blanket issue to remind us all to address the TODOs in the 
 sources for when we upgrade to the next POI.
 I think this (a single blanket issue) is better than keeping separate issues 
 open even though they are technically fixed?
 For example, I've committed TIKA-753 (speedups for embedded office docs), yet 
 it included some TODOs for further speedups possible once we upgrade POI.  
 Rather than keeping TIKA-753 (and others like it) open, I think we should 
 resolve them and let this issue cover all the TODOs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-758:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Address TODOs when we upgrade to next PDFBox release
 

 Key: TIKA-758
 URL: https://issues.apache.org/jira/browse/TIKA-758
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 1.1


 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
 the code when we next upgrade PDFBox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.1

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 

[jira] [Updated] (TIKA-565) Improved OSGi bundling

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-565:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Improved OSGi bundling
 --

 Key: TIKA-565
 URL: https://issues.apache.org/jira/browse/TIKA-565
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.1

 Attachments: core-bundle-fix.diff


 I'd like to add proper integration tests for tika-bundle and expose the Tika 
 facade object as a service so other bundles could access it easily like this:
 @Reference
 private Tika tika;
 It would also be nice to allow other OSGi bundles to expose their Parser 
 implementations as pluggable services and have the Tika bundle automatically 
 pick up and use them along with all the embedded parsers it contains.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira