[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

2011-09-01 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095644#comment-13095644 ] Uwe Schindler commented on TIKA-683: XML SAX Handling does not validate the element name

[jira] [Updated] (TIKA-207) MS word doc containing tracked changes produces incorrect text

2011-09-01 Thread Curt Arnold (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Curt Arnold updated TIKA-207: - Attachment: TIKA-207.patch Replaces earlier patch which could throw a NullPointerException when rendering

[jira] [Updated] (TIKA-683) RTF Parser issues with non european characters

2011-09-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-683: Attachment: TIKA-683.patch Attached patch, with a first cut at using a simple (shallow) token

[jira] [Updated] (TIKA-207) MS word doc containing tracked changes produces incorrect text

2011-09-01 Thread Curt Arnold (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Curt Arnold updated TIKA-207: - Attachment: TIKA-207.patch Refined fix to suppress deleted text in .doc files. Will follow up with test ca

[jira] [Resolved] (TIKA-687) Temporary file not removed after detection

2011-09-01 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-687. Resolution: Duplicate Assignee: Jukka Zitting Right, sorry for overlooking this issue! The prop

[jira] [Updated] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

2011-09-01 Thread Jeremy Anderson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Anderson updated TIKA-704: - Attachment: TestWithPdf.docx TestWithOutlook.docx recursiveUsage.txt

[jira] [Created] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

2011-09-01 Thread Jeremy Anderson (JIRA)
PDF and Outlook docs embedded in MS Word documents not parsed - Key: TIKA-704 URL: https://issues.apache.org/jira/browse/TIKA-704 Project: Tika Issue Type: Bug Components:

[jira] [Commented] (TIKA-701) Fix problems with TemporaryFiles

2011-09-01 Thread Paul Jakubik (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095422#comment-13095422 ] Paul Jakubik commented on TIKA-701: --- This is a very important fix. Will it be released soo

Re: svn commit: r1163970 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-core/src/main/java/org/apache/tika/io/ tika-core/src/main/java/org/apache/tika/parser/ tika-core/src/

2011-09-01 Thread Jukka Zitting
Hi, On Thu, Sep 1, 2011 at 5:08 PM, Michael McCandless wrote: > We might want to mark APIs like TemporaryResources "internal" in the > javadocs, ie, that we reseve the right to suddenly change them and > they are just public so that the sub-packages in Tika can use them. The trouble is that we'l

Re: svn commit: r1163970 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-core/src/main/java/org/apache/tika/io/ tika-core/src/main/java/org/apache/tika/parser/ tika-core/src/

2011-09-01 Thread Mattmann, Chris A (388J)
On Sep 1, 2011, at 8:08 AM, Michael McCandless wrote: > OK thanks Jukka. > > We might want to mark APIs like TemporaryResources "internal" in the > javadocs, ie, that we reseve the right to suddenly change them and > they are just public so that the sub-packages in Tika can use them. > In Lucene

[jira] [Commented] (TIKA-701) Fix problems with TemporaryFiles

2011-09-01 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095371#comment-13095371 ] Jukka Zitting commented on TIKA-701: The idea behind that logic is that if the stream we

[jira] [Commented] (TIKA-701) Fix problems with TemporaryFiles

2011-09-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095363#comment-13095363 ] Michael McCandless commented on TIKA-701: - These changes look great! I like that TI

[jira] [Commented] (TIKA-687) Temporary file not removed after detection

2011-09-01 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095359#comment-13095359 ] Michael McCandless commented on TIKA-687: - I think this may have been fixed by TIKA-

Re: [jira] [Resolved] (TIKA-701) Fix problems with TemporaryFiles

2011-09-01 Thread Mark Kerzner
Thank you, will try it soon :) Mark On Thu, Sep 1, 2011 at 10:32 AM, Jukka Zitting (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Jukka Zitting resolved TIKA-701. > >

[jira] [Resolved] (TIKA-701) Fix problems with TemporaryFiles

2011-09-01 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-701. Resolution: Fixed Assignee: Jukka Zitting Fixed in a series of recent commits. To summarize, I

Re: [jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

2011-09-01 Thread Mark Kerzner
>From this comment I see that one can tell whether this MS Word has "track changes" on, is that true? -- Thank you. Mark On Thu, Sep 1, 2011 at 10:24 AM, Curt Arnold (JIRA) wrote: > >[ > https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comm

[jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

2011-09-01 Thread Curt Arnold (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095346#comment-13095346 ] Curt Arnold commented on TIKA-207: -- I also ran into this problem and at least the manifesta

Re: Resource management patterns (Was: Tika leaves files open)

2011-09-01 Thread Michael McCandless
On Thu, Sep 1, 2011 at 7:26 AM, Jukka Zitting wrote: > Hi, > > [update subject, move to dev@] > > On Thu, Sep 1, 2011 at 12:41 PM, Uwe Schindler wrote: >> With our internal Lucene IOUtils it's even simplier, see javadocs :-) > > Yep, Lucene's version is certainly better. > >> It's just a few line

Re: svn commit: r1163970 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-core/src/main/java/org/apache/tika/io/ tika-core/src/main/java/org/apache/tika/parser/ tika-core/src/

2011-09-01 Thread Michael McCandless
OK thanks Jukka. We might want to mark APIs like TemporaryResources "internal" in the javadocs, ie, that we reseve the right to suddenly change them and they are just public so that the sub-packages in Tika can use them. In Lucene we added @lucene.internal javadoc tag for this (it expands into des

[jira] [Created] (TIKA-703) Drop deprecated methods/classes/interfaces

2011-09-01 Thread Jukka Zitting (JIRA)
Drop deprecated methods/classes/interfaces -- Key: TIKA-703 URL: https://issues.apache.org/jira/browse/TIKA-703 Project: Tika Issue Type: Improvement Reporter: Jukka Zitting Pri

Re: svn commit: r1163970 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-core/src/main/java/org/apache/tika/io/ tika-core/src/main/java/org/apache/tika/parser/ tika-core/src/

2011-09-01 Thread Jukka Zitting
Hi, On Thu, Sep 1, 2011 at 12:23 PM, Michael McCandless wrote: > Can we just remove (not deprecate) TemporaryFiles...? > (We are not at 1.0 release yet). Yes, I think we should do that. I didn't want to do this in the scope of TIKA-701 so I rather left a deprecated backwards-compatible class th

Resource management patterns (Was: Tika leaves files open)

2011-09-01 Thread Jukka Zitting
Hi, [update subject, move to dev@] On Thu, Sep 1, 2011 at 12:41 PM, Uwe Schindler wrote: > With our internal Lucene IOUtils it's even simplier, see javadocs :-) Yep, Lucene's version is certainly better. > It's just a few lines more code. It's still at least 7 lines of wrapper code compared t

Re: svn commit: r1163970 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-core/src/main/java/org/apache/tika/io/ tika-core/src/main/java/org/apache/tika/parser/ tika-core/src/

2011-09-01 Thread Michael McCandless
Can we just remove (not deprecate) TemporaryFiles...? (We are not at 1.0 release yet). Mike McCandless http://blog.mikemccandless.com On Thu, Sep 1, 2011 at 5:38 AM, wrote: > Author: jukka > Date: Thu Sep  1 09:38:04 2011 > New Revision: 1163970 > > URL: http://svn.apache.org/viewvc?rev=11639

[jira] [Created] (TIKA-702) Cannot compile Tika with Java 7 (ImageMetadataExtractor.java)

2011-09-01 Thread Michael McCandless (JIRA)
Cannot compile Tika with Java 7 (ImageMetadataExtractor.java) - Key: TIKA-702 URL: https://issues.apache.org/jira/browse/TIKA-702 Project: Tika Issue Type: Bug Reporter:

Re: svn commit: r1163336 - in /tika/trunk/tika-parsers/src/test: java/org/apache/tika/parser/rtf/ resources/test-documents/

2011-09-01 Thread Michael McCandless
On Tue, Aug 30, 2011 at 5:35 PM, Jukka Zitting wrote: > Hi, > > On Tue, Aug 30, 2011 at 9:07 PM,   wrote: >> +        assertContains("zażółć gęślÄ… jaźń", content); >> +        assertContains("ZAŻÓŠĆ GĘŚLÄ„ JAŹŃ", content); > > I think it would be best if we used \u escapes for