[jira] [Updated] (TIKA-423) Parse docx and output to text file missing words

2011-10-07 Thread Jukka Zitting (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting updated TIKA-423: --- Affects Version/s: 0.8 0.9 0.10 This is still a problem

[jira] [Updated] (TIKA-410) textbox content extaction for word documents

2011-10-07 Thread Jukka Zitting (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting updated TIKA-410: --- Affects Version/s: 0.10 This is still an issue with Tika 0.10 and the latest trunk.

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-07 Thread Anirban Mitra (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122614#comment-13122614 ] Anirban Mitra commented on TIKA-734: Thanks. I will let you know soon. -- Anirban

[jira] [Commented] (TIKA-272) Expose characters offsets information while parsing text-based inputs.

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122625#comment-13122625 ] Jukka Zitting commented on TIKA-272: See PDFBOX-577 for some related work in PDFBox.

[jira] [Resolved] (TIKA-123) Structured MS Office parsing

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-123. Resolution: Duplicate Much of this was already implemented recently in other issues, so resolving as

[jira] [Resolved] (TIKA-429) Error parsing DTD

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-429. Resolution: Fixed Fix Version/s: 1.0 Assignee: Jukka Zitting Looks like there's no

[jira] [Commented] (TIKA-513) Support of Deja Vu (DjVu) format

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122639#comment-13122639 ] Jukka Zitting commented on TIKA-513: Is there a DjVu parser we could use?

[jira] [Resolved] (TIKA-554) ParseUtils.getStringContent needs an option to set the write limit that can be passed into the BodyContentHandler

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-554. Resolution: Won't Fix Assignee: Jukka Zitting Resolving as Won't Fix since the ParseUtils

[jira] [Resolved] (TIKA-581) Parser fails on files that parsed with v0.7

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-581. Resolution: Fixed Fix Version/s: 1.0 Assignee: Jukka Zitting This was already fixed.

[jira] [Resolved] (TIKA-576) OutofMemory issues while building Tika

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-576. Resolution: Won't Fix Resolving as Won't Fix since this is a rare enough problem and the workaround

[jira] [Resolved] (TIKA-509) Container contents extraction

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-509. Resolution: Fixed Fix Version/s: 1.0 Resolving as fixed as discussed above.

[jira] [Resolved] (TIKA-685) Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1a8402c

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-685. Resolution: Duplicate Works with latest Tika, so resolving as a duplicate of some of the other

Re: [jira] [Commented] (TIKA-513) Support of Deja Vu (DjVu) format

2011-10-07 Thread Oleg Tikhonov
There is the one (GPL) I've been playing with: http://javadjvu.foxtrottechnologies.com/ However, in order to extract text/context from images, we have to find suitable implementation of OCR. On Fri, Oct 7, 2011 at 11:02 AM, Jukka Zitting (Commented) (JIRA) j...@apache.org wrote: [

[jira] [Commented] (TIKA-682) Creative Suite formats are not supported

2011-10-07 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122934#comment-13122934 ] Nick Burch commented on TIKA-682: - ImageParser currently claims to support image/x-psd,

[jira] [Created] (TIKA-748) RTF parser fails to extract the body

2011-10-07 Thread Andrzej Bialecki (Created) (JIRA)
RTF parser fails to extract the body Key: TIKA-748 URL: https://issues.apache.org/jira/browse/TIKA-748 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10

[jira] [Updated] (TIKA-748) RTF parser fails to extract the body

2011-10-07 Thread Andrzej Bialecki (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated TIKA-748: --- Attachment: test.rtf RTF parser fails to extract the body

[jira] [Resolved] (TIKA-541) Use commons-cli in lieu of writing our own option parser

2011-10-07 Thread Jukka Zitting (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-541. Resolution: Won't Fix I don't see much benefit to using commons-cli in our case, so resolving as

[jira] [Commented] (TIKA-682) Creative Suite formats are not supported

2011-10-07 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123181#comment-13123181 ] Nick Burch commented on TIKA-682: - I've added a basic metadata extracting parser in

Build failed in Jenkins: Tika-trunk » Apache Tika parsers #674

2011-10-07 Thread Apache Jenkins Server
See https://builds.apache.org/job/Tika-trunk/org.apache.tika$tika-parsers/674/changes Changes: [nick] TIKA-682 Add a basic PSD metadata extracting Parser -- ignoring exception during new ExecutedMojo null [PMD] Skipping maven reporter: there is already a

Build failed in Jenkins: Tika-trunk #674

2011-10-07 Thread Apache Jenkins Server
See https://builds.apache.org/job/Tika-trunk/674/changes Changes: [nick] TIKA-682 Add a basic PSD metadata extracting Parser [nick] TIKA-749 Add EndianUtils, which provides a way to read small and big endian numbers from streams, based on the version in POI [nick] TIKA-682 Add mime magic

[jira] [Resolved] (TIKA-749) Avoid using POI's LittleEndian in non-POI parsers

2011-10-07 Thread Nick Burch (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-749. - Resolution: Fixed Avoid using POI's LittleEndian in non-POI parsers

[jira] [Commented] (TIKA-749) Avoid using POI's LittleEndian in non-POI parsers

2011-10-07 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123200#comment-13123200 ] Nick Burch commented on TIKA-749: - Done in r1180243. Avoid using POI's

Jenkins build is back to normal : Tika-trunk » Apache Tika parsers #675

2011-10-07 Thread Apache Jenkins Server
See https://builds.apache.org/job/Tika-trunk/org.apache.tika$tika-parsers/675/changes

Jenkins build is back to normal : Tika-trunk #675

2011-10-07 Thread Apache Jenkins Server
See https://builds.apache.org/job/Tika-trunk/675/changes