[jira] Commented: (TIKA-608) IOException from tagsoup

2011-03-02 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001720#comment-13001720 ] Geoff Jarrad commented on TIKA-608: --- This is one of the known problems with tagsoup. It is

[jira] Updated: (TIKA-534) MetadataException: Unsupported component id error parsing jpg

2010-10-18 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geoff Jarrad updated TIKA-534: -- Attachment: Billy Boys II.JPG While I'm complaining about the ImageMetadataExtractor, here is another err

[jira] Updated: (TIKA-534) MetadataException: Unsupported component id error parsing jpg

2010-10-18 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geoff Jarrad updated TIKA-534: -- Attachment: dickson1.jpg seibicke.jpg Some images that browse okay but fail to parse. >

[jira] Created: (TIKA-534) MetadataException: Unsupported component id error parsing jpg

2010-10-18 Thread Geoff Jarrad (JIRA)
MetadataException: Unsupported component id error parsing jpg - Key: TIKA-534 URL: https://issues.apache.org/jira/browse/TIKA-534 Project: Tika Issue Type: Bug Components:

[jira] Commented: (TIKA-533) Mis-detection of zip-within-zip as application/vnd.apple.iwork, with no output by CLI app

2010-10-17 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921930#action_12921930 ] Geoff Jarrad commented on TIKA-533: --- No, using the ContainerAwareDetector doesn't seem to c

[jira] Updated: (TIKA-533) Mis-detection of zip-within-zip as application/vnd.apple.iwork, with no output by CLI app

2010-10-17 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geoff Jarrad updated TIKA-533: -- Attachment: zip-within-zip.zip Simple zip file containing a zip of an HTML file, demonstrating the specif

[jira] Created: (TIKA-533) Mis-detection of zip-within-zip as application/vnd.apple.iwork, with no output by CLI app

2010-10-17 Thread Geoff Jarrad (JIRA)
Mis-detection of zip-within-zip as application/vnd.apple.iwork, with no output by CLI app - Key: TIKA-533 URL: https://issues.apache.org/jira/browse/TIKA-533 Proj

[jira] Updated: (TIKA-526) OOXMLParser fails to extract text from within smart tags

2010-10-04 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geoff Jarrad updated TIKA-526: -- Attachment: smarttag-snippet.docx An example .docx containing smart-tagged text that is not extracted by

[jira] Created: (TIKA-526) OOXMLParser fails to extract text from within smart tags

2010-10-04 Thread Geoff Jarrad (JIRA)
OOXMLParser fails to extract text from within smart tags Key: TIKA-526 URL: https://issues.apache.org/jira/browse/TIKA-526 Project: Tika Issue Type: Bug Components: parser

[jira] Created: (TIKA-525) Mismatched start and end elements in HtmlParser

2010-10-04 Thread Geoff Jarrad (JIRA)
Mismatched start and end elements in HtmlParser --- Key: TIKA-525 URL: https://issues.apache.org/jira/browse/TIKA-525 Project: Tika Issue Type: Bug Components: parser Affects Versions

[jira] Created: (TIKA-524) Unification of HTML output from Office, OOXML and Open Document parsers

2010-10-04 Thread Geoff Jarrad (JIRA)
Unification of HTML output from Office, OOXML and Open Document parsers --- Key: TIKA-524 URL: https://issues.apache.org/jira/browse/TIKA-524 Project: Tika Issue Type: Impro

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

2010-09-30 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916738#action_12916738 ] Geoff Jarrad commented on TIKA-506: --- I didn't quite mean that Tika should output the same l

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

2010-09-28 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915971#action_12915971 ] Geoff Jarrad commented on TIKA-506: --- Brilliant work, Nick! Thanks. The sample.doc runs thro

[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

2010-09-27 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geoff Jarrad updated TIKA-506: -- Attachment: sample.doc The attached sample.doc Word document breaks the OfficeParser: java.util.NoSuchEl

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

2010-09-27 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915557#action_12915557 ] Geoff Jarrad commented on TIKA-506: --- Good work on extracting more of .doc and .docx documen

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

2010-09-26 Thread Geoff Jarrad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915114#action_12915114 ] Geoff Jarrad commented on TIKA-506: --- Tables in Word documents (.doc via OfficeParser) appea