Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

2011-09-05 Thread Jukka Zitting
Hi, On Mon, Sep 5, 2011 at 12:30 PM, wrote: > Embedded file extraction is broken for some OOXML files > (bug introduced few commits ago) That was me in revision 1164578 for TIKA-704. :-( > -            if (root.hasEntry("CONTENTS")) { > -                stream = TikaInputStream.get( > -      

[jira] [Commented] (TIKA-698) "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003

2011-09-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097150#comment-13097150 ] Michael McCandless commented on TIKA-698: - Hmm, I think we should replace invalid ch

[jira] [Created] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-05 Thread Michael McCandless (JIRA)
Valid OOXML PPT file hits InvalidFormatException thrown in POI -- Key: TIKA-705 URL: https://issues.apache.org/jira/browse/TIKA-705 Project: Tika Issue Type: Bug Reporte

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-705: Attachment: testPPT_various.pptx PPTX file showing the exception. > Valid OOXML PPT file hit

[jira] [Updated] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-705: Component/s: parser Affects Version/s: 0.9 Fix Version/s: 1.0 > Valid OOXML

[jira] [Commented] (TIKA-698) "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003

2011-09-05 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097160#comment-13097160 ] Jukka Zitting commented on TIKA-698: bq. Hmm, I think we should replace invalid chars wi

Build failed in Jenkins: Tika-trunk #614

2011-09-05 Thread Apache Jenkins Server
See -- Started by an SCM change Building remotely on solaris2 Checking out a fresh workspace because /zonestorage/hudson doesn't exist Cleaning workspace /z

Re: Build failed in Jenkins: Tika-trunk #614

2011-09-05 Thread Jukka Zitting
Hi, On Mon, Sep 5, 2011 at 5:01 PM, Apache Jenkins Server wrote: > ERROR: Failed to check out http://svn.apache.org/repos/asf/tika/trunk > org.tmatesoft.svn.core.SVNException: svn: Cannot create new file > '/zonestorage/hudson: > Per

[jira] [Commented] (TIKA-698) "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003

2011-09-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097191#comment-13097191 ] Michael McCandless commented on TIKA-698: - OK, thanks, I'll do that! > "Invalid UTF

Jenkins build is back to normal : Tika-trunk #615

2011-09-05 Thread Apache Jenkins Server
See

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

2011-09-05 Thread Maxim Valyanskiy
Hello! 05.09.2011, в 16:23, Jukka Zitting написал(а): > That was me in revision 1164578 for TIKA-704. :-( > >> -if (root.hasEntry("CONTENTS")) { >> -stream = TikaInputStream.get( >> -fs.createDocumentInputStream("CONTENTS")); > > This was my a

[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

2011-09-05 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097212#comment-13097212 ] Jukka Zitting commented on TIKA-704: See also revisions 1165230 and 1165259 for followup

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

2011-09-05 Thread Jukka Zitting
Hi, 2011/9/5 Maxim Valyanskiy : > 05.09.2011, в 16:23, Jukka Zitting написал(а): >> This was my attempt at properly handling the embedded PDF in >> TestWithPdf.docx. It was included in an OLE object with the PDF >> document as it's "CONTENTS" entry. I restored this functionality with >> some more

Re: svn commit: r1165230 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/microsoft/ooxml/ test/java/org/apache/tika/parser/microsoft/ test/resources/test-documents/

2011-09-05 Thread Nick Burch
On Mon, 5 Sep 2011, Jukka Zitting wrote: Hm, that is strange - current version of OfficeParser.POIFSDocumentType.detectType() thinks that "CONTENTS" part identifies POI filesystem as MS Works document. Maybe this is not right. I think we have some MS Works test files that do contain the "CONT

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097217#comment-13097217 ] Nick Burch commented on TIKA-705: - Looks to be a problem with a reference to part of a slide

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-05 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097261#comment-13097261 ] Michael McCandless commented on TIKA-705: - Thanks for looking at this Nick! So, is

[jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI

2011-09-05 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097319#comment-13097319 ] Nick Burch commented on TIKA-705: - I'll need to read the spec to be sure, but I have a feeli