[jira] [Commented] (TIKA-877) Embedded document not extracted (regression)

2012-03-20 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233380#comment-13233380 ] Michael McCandless commented on TIKA-877: - I'm also surprised this change broke

[jira] [Commented] (TIKA-870) Allow to use call parseToString with a additional parameter of MaxStringLength, so it can be changed per call

2012-03-07 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224643#comment-13224643 ] Michael McCandless commented on TIKA-870: - I think this makes sense.

[jira] [Commented] (TIKA-801) ContentHandlerDecorator outputs invalid element

2011-12-08 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165288#comment-13165288 ] Michael McCandless commented on TIKA-801: - Actually this isn't a problem of a

[jira] [Commented] (TIKA-801) ContentHandlerDecorator outputs invalid element

2011-12-05 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162781#comment-13162781 ] Michael McCandless commented on TIKA-801: - This is happening because the

[jira] [Commented] (TIKA-796) Tika breaks words of rotated text in PDF documents

2011-12-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160841#comment-13160841 ] Michael McCandless commented on TIKA-796: - This looks like a dup of TIKA-723? Note

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

2011-11-26 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157602#comment-13157602 ] Michael McCandless commented on TIKA-723: - The sortByPosition option is tricky to

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

2011-11-17 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152041#comment-13152041 ] Michael McCandless commented on TIKA-782: - These changes look great! Cutover to

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

2011-11-17 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152119#comment-13152119 ] Michael McCandless commented on TIKA-782: - bq. I'll make the necessary changes.

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

2011-11-17 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152133#comment-13152133 ] Michael McCandless commented on TIKA-724: - Alas, no, I don't believe you can control

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

2011-11-17 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152246#comment-13152246 ] Michael McCandless commented on TIKA-782: - OK looks great Arjohn! Do you have an

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

2011-11-11 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148695#comment-13148695 ] Michael McCandless commented on TIKA-612: - I agree, we probably shouldn't just

[jira] [Commented] (TIKA-714) Word art isn't extracted for various doc types

2011-11-06 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144979#comment-13144979 ] Michael McCandless commented on TIKA-714: - OK I dug into this one a bit. First off,

[jira] [Commented] (TIKA-529) IBM420 charset detection's isLamAlef is allocation-happy

2011-11-05 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144652#comment-13144652 ] Michael McCandless commented on TIKA-529: - This patch looks safe, and avoids crazy

[jira] [Commented] (TIKA-582) Lithuanian language identification

2011-10-27 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13137056#comment-13137056 ] Michael McCandless commented on TIKA-582: - bq. Can you suggest what needs to be done

[jira] [Commented] (TIKA-582) Lithuanian language identification

2011-10-26 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135914#comment-13135914 ] Michael McCandless commented on TIKA-582: - Thanks Žygimantas! When testing Tika's

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-26 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136070#comment-13136070 ] Michael McCandless commented on TIKA-736: - bq. Can you also check that parsing

[jira] [Commented] (TIKA-738) Tika fails to extract text from PDF annotations

2011-10-18 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13129889#comment-13129889 ] Michael McCandless commented on TIKA-738: - I opened PDFBOX-1143 to improve

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

2011-10-17 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128924#comment-13128924 ] Michael McCandless commented on TIKA-753: - OK I committed this; I'll leave it open

[jira] [Commented] (TIKA-748) RTF parser fails to extract the body

2011-10-10 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124355#comment-13124355 ] Michael McCandless commented on TIKA-748: - Thanks Andrzej! RTF

[jira] [Commented] (TIKA-748) RTF parser fails to extract the body

2011-10-09 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123804#comment-13123804 ] Michael McCandless commented on TIKA-748: - Hmm I think this doc is slightly

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-04 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1311#comment-1311 ] Michael McCandless commented on TIKA-733: - Thank you Jeremy! Keep the patches

[jira] [Commented] (TIKA-717) Comment/annotation is sometimes not extracted

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119237#comment-13119237 ] Michael McCandless commented on TIKA-717: - RTF and PPT are now extracting comments

[jira] [Commented] (TIKA-738) Tika fails to extract text from PDF annotations

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119269#comment-13119269 ] Michael McCandless commented on TIKA-738: - I moved the failing (but ignored) test

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119462#comment-13119462 ] Michael McCandless commented on TIKA-733: - Actually, I think we should just commit

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119035#comment-13119035 ] Michael McCandless commented on TIKA-721: - bq. I'd suggest we check for invalid

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119044#comment-13119044 ] Michael McCandless commented on TIKA-721: - {quote} bq. Finally, for the valid code

[jira] [Commented] (TIKA-737) Use (Incubating) ODFToolkit to improve ODF file format processing

2011-10-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118834#comment-13118834 ] Michael McCandless commented on TIKA-737: - +1, sounds great! Use

[jira] [Commented] (TIKA-711) Word parser doesn't extract optional hyphen correctly

2011-10-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118905#comment-13118905 ] Michael McCandless commented on TIKA-711: - Curiously, if I use POI's