[jira] [Commented] (TIKA-747) Ogg Vorbis and FLAC Parsers

2012-03-31 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243602#comment-13243602 ] Jan Høydahl commented on TIKA-747: -- Why is this issue listed as included in the Tik

[jira] [Commented] (TIKA-582) Lithuanian language identification

2011-10-27 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136878#comment-13136878 ] Žygimantas Medelis commented on TIKA-582: - > relax the test just for Lithua

[jira] [Commented] (TIKA-638) Language recognition - Failed trying to load language profile for language lt . Error: java.lang.IllegalArgumentException: Unable to add an ngram of incorrect length: 5 !

2012-01-18 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188452#comment-13188452 ] Jan Høydahl commented on TIKA-638: -- Ok, will someone close this bug as "not

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

2012-02-06 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201321#comment-13201321 ] Jan Høydahl commented on TIKA-856: -- The command to create a profile is: {code} java

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

2012-02-10 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205370#comment-13205370 ] Jan Høydahl commented on TIKA-612: -- So how do we set a PDFBox option via ParseContex

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

2012-02-10 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205394#comment-13205394 ] Jan Høydahl commented on TIKA-612: -- Hmm, that's kind of awkward to use from e.g.

[jira] [Commented] (TIKA-887) Tika fails to parse some MP3 tags correctly and produces null characters in value

2012-03-29 Thread Commented
[ https://issues.apache.org/jira/browse/TIKA-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241152#comment-13241152 ] Jens Hübel commented on TIKA-887: - However there is a difference. It is no longer a

[jira] [Commented] (TIKA-697) Tika reports the content type of AR archives as "text/plain"

2011-11-07 Thread PNS (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145299#comment-13145299 ] PNS commented on TIKA-697: -- Detection of Unix AR archive types (see http://en.wikipedia.org/

[jira] [Commented] (TIKA-697) Tika reports the content type of AR archives as "text/plain"

2011-11-07 Thread PNS (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145494#comment-13145494 ] PNS commented on TIKA-697: -- Even better, but maybe we need to add "*.ar" as a glo

[jira] [Commented] (TIKA-593) Tika network server

2012-03-29 Thread Sergey Beryozkin (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241316#comment-13241316 ] Sergey Beryozkin commented on TIKA-593: --- Max, > ... That exception mapper w

[jira] [Commented] (TIKA-593) Tika network server

2012-03-29 Thread Sergey Beryozkin (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241331#comment-13241331 ] Sergey Beryozkin commented on TIKA-593: --- > I think that something is wrong in

[jira] [Commented] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2012-03-30 Thread Uwe Schindler (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242378#comment-13242378 ] Uwe Schindler commented on TIKA-888: Thanks Chris, we are already planning to re

[jira] [Commented] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2012-03-30 Thread Uwe Schindler (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242397#comment-13242397 ] Uwe Schindler commented on TIKA-888: {quote} Couldn't you take the Parser o

[jira] [Commented] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2012-03-30 Thread Uwe Schindler (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242399#comment-13242399 ] Uwe Schindler commented on TIKA-888: Another good idea would be to allow remova

[jira] [Commented] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2012-03-30 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242408#comment-13242408 ] Jukka Zitting commented on TIKA-888: bq. The question is: The parser is still liste

[jira] [Commented] (TIKA-888) NetCDF parser uses Java 6 JAR file and test/compilation fails with Java 1.5, although TIKA is Java 1.5

2012-03-30 Thread Uwe Schindler (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242430#comment-13242430 ] Uwe Schindler commented on TIKA-888: Maybe it does nbot produce ClassNotFound. Fo

[jira] [Commented] (TIKA-792) NoSuchMethodException "CTMarkupImpl.(org.apache.xmlbeans.SchemaType, boolean)" processing a OOXML document

2012-04-03 Thread Marek Slama (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245404#comment-13245404 ] Marek Slama commented on TIKA-792: -- I do not see this problem now as we upgrade

[jira] [Commented] (TIKA-700) Upgrade to POI 3.8 as available

2012-04-03 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245435#comment-13245435 ] Nick Burch commented on TIKA-700: - Upgraded to POI 3.8 Final in r130

[jira] [Commented] (TIKA-792) NoSuchMethodException "CTMarkupImpl.(org.apache.xmlbeans.SchemaType, boolean)" processing a OOXML document

2012-04-03 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245436#comment-13245436 ] Nick Burch commented on TIKA-792: - Thanks for the feedback Marek. As of r1309005 we&#x

[jira] [Commented] (TIKA-593) Tika network server

2012-04-04 Thread Sergey Beryozkin (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246592#comment-13246592 ] Sergey Beryozkin commented on TIKA-593: --- Max, Chris, thanks ! &g

[jira] [Commented] (TIKA-593) Tika network server

2012-04-04 Thread Markus Jelsma (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246615#comment-13246615 ] Markus Jelsma commented on TIKA-593: Great work! > Tika

[jira] [Commented] (TIKA-593) Tika network server

2012-04-05 Thread Maxim Valyanskiy (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247089#comment-13247089 ] Maxim Valyanskiy commented on TIKA-593: --- I updated documentation in

[jira] [Commented] (TIKA-890) Improve detection of Android Packages (APK)

2012-04-05 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247211#comment-13247211 ] Nick Burch commented on TIKA-890: - As of r1309854, APK files (along with WAR and EAR, w

[jira] [Commented] (TIKA-893) Tika-server bundle includes wrong META-INF/services/org.apache.tika.parser.Parser, doesn't work

2012-04-16 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254617#comment-13254617 ] Nick Burch commented on TIKA-893: - See my comment on TIKA-747 too, it affects tika-app

[jira] [Commented] (TIKA-897) UTF-8 encoded XML is detected as text/plain because of UTF-8 BOM

2012-04-20 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258241#comment-13258241 ] Nick Burch commented on TIKA-897: - We had support for detecting XML files that are A

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-09-28 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116331#comment-13116331 ] Michael McCandless commented on TIKA-733: - Hmm, it makes me a little nervous

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-09-28 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116467#comment-13116467 ] Nick Burch commented on TIKA-734: - Please re-test with a newer version of Tika (ide

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-09-28 Thread Anirban Mitra (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116488#comment-13116488 ] Anirban Mitra commented on TIKA-734: Thank you very much Nick. How long I need to

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-09-28 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116517#comment-13116517 ] Nick Burch commented on TIKA-734: - The 0.10 release vote is open for another few hours.

[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-29 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117311#comment-13117311 ] Nick Burch commented on TIKA-727: - Thanks for the patch, applied with a few tweak

[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-29 Thread Pablo Queixalos (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117339#comment-13117339 ] Pablo Queixalos commented on TIKA-727: -- I just realized that the concerned PPT fil

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

2011-10-01 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118782#comment-13118782 ] Nick Burch commented on TIKA-735: - I think this is a Tika CLI issue, rather than a Pa

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-01 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118783#comment-13118783 ] Nick Burch commented on TIKA-736: - It's probably not worth putting too much work

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118825#comment-13118825 ] Michael McCandless commented on TIKA-736: - OK that makes sense; hopefully it&#

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-01 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118828#comment-13118828 ] Nick Burch commented on TIKA-736: - Looking at our current parser, we don't touch t

[jira] [Commented] (TIKA-737) Use (Incubating) ODFToolkit to improve ODF file format processing

2011-10-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118834#comment-13118834 ] Michael McCandless commented on TIKA-737: - +1, sounds great! &

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

2011-10-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118833#comment-13118833 ] Michael McCandless commented on TIKA-735: - Ahhh, I see. So it looks like

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

2011-10-01 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118861#comment-13118861 ] Jukka Zitting commented on TIKA-735: A parser should always produce valid XHTML ou

[jira] [Commented] (TIKA-711) Word parser doesn't extract optional hyphen correctly

2011-10-01 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118905#comment-13118905 ] Michael McCandless commented on TIKA-711: - Curiously, if I use P

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119018#comment-13119018 ] Nick Burch commented on TIKA-721: - I'd suggest we check for invalid UTF-16 seque

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119035#comment-13119035 ] Michael McCandless commented on TIKA-721: - bq. I'd suggest we check for in

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Robert Muir (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119038#comment-13119038 ] Robert Muir commented on TIKA-721: -- {quote} Finally, for the valid code points, I c

[jira] [Commented] (TIKA-721) UTF16-LE not detected

2011-10-02 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119044#comment-13119044 ] Michael McCandless commented on TIKA-721: - {quote} bq. Finally, for the valid

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

2011-10-02 Thread Robert Muir (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119060#comment-13119060 ] Robert Muir commented on TIKA-713: -- I created PDFBOX-1127 for this with some screens

[jira] [Commented] (TIKA-717) Comment/annotation is sometimes not extracted

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119237#comment-13119237 ] Michael McCandless commented on TIKA-717: - RTF and PPT are now extracting comm

[jira] [Commented] (TIKA-717) Comment/annotation is sometimes not extracted

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119238#comment-13119238 ] Michael McCandless commented on TIKA-717: - OK I opened TIKA-738 for PDF annotat

[jira] [Commented] (TIKA-738) Tika fails to extract text from PDF annotations

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119269#comment-13119269 ] Michael McCandless commented on TIKA-738: - I moved the failing (but ignored)

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

2011-10-03 Thread Robert Muir (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119371#comment-13119371 ] Robert Muir commented on TIKA-713: -- This is now fixed in pdfbox's trunk.

[jira] [Commented] (TIKA-722) Arabic PDF doesn't extract correctly

2011-10-03 Thread Robert Muir (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119403#comment-13119403 ] Robert Muir commented on TIKA-722: -- Actually in this case the original TTF font (AxtM

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-03 Thread Jeremy Anderson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] Jeremy Anderson commented on TIKA-733: -- The problem is also present in the older

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119462#comment-13119462 ] Michael McCandless commented on TIKA-733: - Actually, I think we should just co

[jira] [Commented] (TIKA-739) For certain DWG files, the Tika content parser outputs garbage

2011-10-03 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119588#comment-13119588 ] Nick Burch commented on TIKA-739: - What version of Tika are you using? And if it isn&#x

[jira] [Commented] (TIKA-739) For certain DWG files, the Tika content parser outputs garbage

2011-10-03 Thread John Bartak (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119614#comment-13119614 ] John Bartak commented on TIKA-739: -- Not entirely sure what version I'm using.

[jira] [Commented] (TIKA-739) For certain DWG files, the Tika content parser outputs garbage

2011-10-03 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119639#comment-13119639 ] Nick Burch commented on TIKA-739: - Someone may chime in with the exact answer, in the

[jira] [Commented] (TIKA-739) For certain DWG files, the Tika content parser outputs garbage

2011-10-03 Thread John Bartak (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119709#comment-13119709 ] John Bartak commented on TIKA-739: -- It's 0.8 :-( . Not sure how easy it will be

[jira] [Commented] (TIKA-739) For certain DWG files, the Tika content parser outputs garbage

2011-10-03 Thread John Bartak (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119730#comment-13119730 ] John Bartak commented on TIKA-739: -- Just downloaded 0.10 and tried extracting the fil

[jira] [Commented] (TIKA-739) For certain DWG files, the Tika content parser outputs garbage

2011-10-03 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119752#comment-13119752 ] Michael McCandless commented on TIKA-739: - I opened SOLR-2807 to upgrade Sol

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-03 Thread Jeremy Anderson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119815#comment-13119815 ] Jeremy Anderson commented on TIKA-733: -- Cool beans!! Thanks for your attention t

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

2011-10-04 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1311#comment-1311 ] Michael McCandless commented on TIKA-733: - Thank you Jeremy! Keep the pat

[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-10-04 Thread Mark Kerzner (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120648#comment-13120648 ] Mark Kerzner commented on TIKA-623: --- Hi, everybody, I have forked Richard Johnson&#x

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121032#comment-13121032 ] Jukka Zitting commented on TIKA-734: Tika 0.10 is now available. If the problem s

[jira] [Commented] (TIKA-741) "Zip bomb" (XML nesting) detection is too strict

2011-10-05 Thread Erik Hetzner (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121064#comment-13121064 ] Erik Hetzner commented on TIKA-741: --- 100 levels should probably do the trick. Th

[jira] [Commented] (TIKA-636) Taking very high heap space while parsing docx - Resulting in OOM in tha app

2011-10-05 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121157#comment-13121157 ] Jukka Zitting commented on TIKA-636: Do you still see this problem with Tika 0.10

[jira] [Commented] (TIKA-713) Tika can not parse all of the persian pdf files

2011-10-05 Thread Ahmad Ajiloo (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121376#comment-13121376 ] Ahmad Ajiloo commented on TIKA-713: --- Thanks a lot > Tika can no

[jira] [Commented] (TIKA-746) Support custom mime types

2011-10-06 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122257#comment-13122257 ] Nick Burch commented on TIKA-746: - First pass at solving this committed in r117

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-07 Thread Anirban Mitra (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122614#comment-13122614 ] Anirban Mitra commented on TIKA-734: Thanks. I will let you know soon. -- Ani

[jira] [Commented] (TIKA-381) HtmlParser should strip linefeeds out of links

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122617#comment-13122617 ] Jukka Zitting commented on TIKA-381: The relevant TagSoup scanner state transitions

[jira] [Commented] (TIKA-272) Expose characters offsets information while parsing text-based inputs.

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122625#comment-13122625 ] Jukka Zitting commented on TIKA-272: See PDFBOX-577 for some related work in PD

[jira] [Commented] (TIKA-513) Support of Deja Vu (DjVu) format

2011-10-07 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122639#comment-13122639 ] Jukka Zitting commented on TIKA-513: Is there a DjVu parser we could

[jira] [Commented] (TIKA-682) Creative Suite formats are not supported

2011-10-07 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122934#comment-13122934 ] Nick Burch commented on TIKA-682: - ImageParser currently claims to support image/x

[jira] [Commented] (TIKA-682) Creative Suite formats are not supported

2011-10-07 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123181#comment-13123181 ] Nick Burch commented on TIKA-682: - I've added a basic metadata extracting

[jira] [Commented] (TIKA-749) Avoid using POI's LittleEndian in non-POI parsers

2011-10-07 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123200#comment-13123200 ] Nick Burch commented on TIKA-749: - Done in r1180243. > Avoid usin

[jira] [Commented] (TIKA-748) RTF parser fails to extract the body

2011-10-09 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123804#comment-13123804 ] Michael McCandless commented on TIKA-748: - Hmm I think this doc is slig

[jira] [Commented] (TIKA-748) RTF parser fails to extract the body

2011-10-10 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124355#comment-13124355 ] Michael McCandless commented on TIKA-748: - Thanks Andrzej! &

[jira] [Commented] (TIKA-748) RTF parser fails to extract the body

2011-10-10 Thread Andrzej Bialecki (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124425#comment-13124425 ] Andrzej Bialecki commented on TIKA-748: Thanks Michael! &

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-11 Thread Anirban Mitra (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125242#comment-13125242 ] Anirban Mitra commented on TIKA-734: Memory issue is gone now but we are seeing a m

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-11 Thread Anirban Mitra (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125305#comment-13125305 ] Anirban Mitra commented on TIKA-734: Hello , Memory issue is gone with xlsx file

[jira] [Commented] (TIKA-734) Out of memory exception with Xlsx file less than 5 MB

2011-10-11 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125385#comment-13125385 ] Nick Burch commented on TIKA-734: - If you're using the AutoDetectParser, then you

[jira] [Commented] (TIKA-93) OCR support

2011-10-12 Thread Enrico Stahn (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125745#comment-13125745 ] Enrico Stahn commented on TIKA-93: -- You could use [docsplit|http://documentcloud.github

[jira] [Commented] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-10-13 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126938#comment-13126938 ] Jukka Zitting commented on TIKA-657: In revision 1183109 I increased the default

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

2011-10-14 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127758#comment-13127758 ] Michael McCandless commented on TIKA-753: - I noticed that when we parse an embe

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

2011-10-15 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128135#comment-13128135 ] Nick Burch commented on TIKA-753: - Patch looks fine to me I've added a static co

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

2011-10-15 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128145#comment-13128145 ] Michael McCandless commented on TIKA-753: - Awesome, thanks Nick! I'll a

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2011-10-15 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128227#comment-13128227 ] Michael McCandless commented on TIKA-712: - I tested the cur

[jira] [Commented] (TIKA-753) Improve performance when parsing embedded Office docs

2011-10-17 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128924#comment-13128924 ] Michael McCandless commented on TIKA-753: - OK I committed this; I'll leav

[jira] [Commented] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2011-10-18 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129637#comment-13129637 ] Jukka Zitting commented on TIKA-754: I don't think it's necessarily a goo

[jira] [Commented] (TIKA-755) Add getDetector() method to TikaConfig

2011-10-18 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129744#comment-13129744 ] Nick Burch commented on TIKA-755: - In r1185658 I've updated TikaConfig to

[jira] [Commented] (TIKA-738) Tika fails to extract text from PDF annotations

2011-10-18 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129889#comment-13129889 ] Michael McCandless commented on TIKA-738: - I opened PDFBOX-1143 to imp

[jira] [Commented] (TIKA-756) XMP output from Tika CLI

2011-10-18 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129957#comment-13129957 ] Jukka Zitting commented on TIKA-756: Rough first version committed in revision 118

[jira] [Commented] (TIKA-755) Add getDetector() method to TikaConfig

2011-10-18 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129971#comment-13129971 ] Jukka Zitting commented on TIKA-755: Hmm, I looked at the interaction between Tika

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

2011-10-19 Thread Michael McCandless (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130514#comment-13130514 ] Michael McCandless commented on TIKA-724: - I dug into this one some more. Hand

[jira] [Commented] (TIKA-760) NPE XHTMLContentHandler in characters Method

2011-10-24 Thread Pablo Queixalos (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133880#comment-13133880 ] Pablo Queixalos commented on TIKA-760: -- Concerning the HSLFExtractor, this is alr

[jira] [Commented] (TIKA-760) NPE XHTMLContentHandler in characters Method

2011-10-24 Thread Torsten Krah (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133898#comment-13133898 ] Torsten Krah commented on TIKA-760: --- Yeah, but there are more calls to this method w

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-24 Thread Uwe Schindler (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133995#comment-13133995 ] Uwe Schindler commented on TIKA-736: The current ODF parser is very lightweight

[jira] [Commented] (TIKA-736) OpenOffice parser: master footer text isn't extracted

2011-10-24 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134003#comment-13134003 ] Nick Burch commented on TIKA-736: - Uwe - this might be best discussed on the &quo

[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134059#comment-13134059 ] Nick Burch commented on TIKA-761: - The version is maintained in the pom, and Maven

[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Ingo Renner (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134068#comment-13134068 ] Ingo Renner commented on TIKA-761: -- @Nick, the version number in the pom was quite obv

[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Ingo Renner (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134105#comment-13134105 ] Ingo Renner commented on TIKA-761: -- BTW, how about using -v (lowercase) to get the ver

[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Nick Burch (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134108#comment-13134108 ] Nick Burch commented on TIKA-761: - I think -v for verbose is more common, so I wouldn&

[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-24 Thread Jukka Zitting (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134113#comment-13134113 ] Jukka Zitting commented on TIKA-761: +1 Looks good. As a possible improvement, as

  1   2   3   4   5   6   >