[jira] [Updated] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

2011-10-01 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-735: Attachment: embeddedText.odp ODP document that leads to above text output from TikaCLI -x.

[jira] [Updated] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-01 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-736: Attachment: TIKA-736.patch testMasterFooter.odp Patch with failing test case.

[jira] [Updated] (TIKA-711) Word parser doesn't extract optional hyphen correctly

2011-10-02 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-711: Attachment: TIKA-711.patch OK, after digging I found out that in fact POI's AbstractWordConv

[jira] [Updated] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-721: Attachment: TIKA-721.patch Attached patch, using three simple heuristics: First, I compute t

[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker

2011-10-04 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-742: Attachment: 86.pdf PDF doc showing the issue (unfortunately not committable).

[jira] [Updated] (TIKA-742) PDF2XHTML fails to insert nor space around page marker

2011-10-04 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-742: Attachment: TIKA-742.patch Patch. > PDF2XHTML fails to insert nor space aro

[jira] [Updated] (TIKA-748) RTF parser fails to extract the body

2011-10-09 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-748: Attachment: TIKA-748.patch Patch. > RTF parser fails to extract the body > -

[jira] [Updated] (TIKA-751) Small improvements to how embedded docs are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc

2011-10-12 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-751: Attachment: TIKA-751.patch Patch. > Small improvements to how embedded docs

[jira] [Updated] (TIKA-753) Improve performance when parsing embedded Office docs

2011-10-14 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-753: Attachment: TIKA-753.patch Patch. > Improve performance when parsing embedde

[jira] [Updated] (TIKA-738) Tika fails to extract text from PDF annotations

2011-10-18 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-738: Attachment: TIKA-738.patch Patch, extracting text from annotations; I added an option to PDFP

[jira] [Updated] (TIKA-724) PDF text sometimes has extra space between letters

2011-10-19 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-724: Attachment: TIKA-724.patch Patch. > PDF text sometimes has extra space betwe

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-10-20 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-705: Fix Version/s: 1.0 > Valid OOXML PPT file hits InvalidFormatException thrown in POI > ---

[jira] [Updated] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-26 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-736: Attachment: TIKA-736.patch This turned out to be fairly simple to fix, so I worked out a patc

[jira] [Updated] (TIKA-767) Enable controlling of PDFBOX's setSuppressDuplicateOverlappingText from PDFParser

2011-11-01 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-767: Attachment: TIKA-767.patch Patch w/ test; I added PDFParser.get/setSuppressDuplicateOverlappi

[jira] [Updated] (TIKA-775) Embed Capabilities

2011-11-09 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-775: Fix Version/s: (was: 1.0) 1.1 > Embed Capabilities > -

[jira] [Updated] (TIKA-774) ExifTool Parser

2011-11-09 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-774: Fix Version/s: (was: 1.0) 1.1 > ExifTool Parser > ---

[jira] [Updated] (TIKA-776) ExifTool Embedder

2011-11-09 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-776: Fix Version/s: (was: 1.0) 1.1 > ExifTool Embedder > --

[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

2011-11-15 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-612: Attachment: TIKA-612.patch Patch, just adding setSortByPosition to PDFParser. I think this i

[jira] [Updated] (TIKA-738) Tika fails to extract text from PDF annotations

2011-11-26 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-738: Attachment: TIKA-738.patch Patch, fixing the excess tag. > Tika fails to ex

[jira] [Updated] (TIKA-801) ContentHandlerDecorator outputs invalid element

2011-12-08 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-801: Attachment: TIKA-801.patch bq. See the org.apache.tika.sax.EmbeddedContentHandler class. Ex

[jira] [Updated] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

2012-03-07 Thread Michael McCandless (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-870: Attachment: TIKA-870.patch Patch, with the sample code plus a test case. The test case faile