[ https://issues.apache.org/jira/browse/TIKA-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123804#comment-13123804 ]
Michael McCandless commented on TIKA-748: ----------------------------------------- Hmm I think this doc is slightly malformed -- it contains \* (followed by \cs7) within a group; \* is supposed to always come after a group start { This is causing Tika to ignore all text in the group. But I think we can be robust here and only ignore text when we see \* right after {, else, ignore it. > RTF parser fails to extract the body > ------------------------------------ > > Key: TIKA-748 > URL: https://issues.apache.org/jira/browse/TIKA-748 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.10 > Reporter: Andrzej Bialecki > Assignee: Michael McCandless > Attachments: TIKA-748.patch, test.rtf > > > Using tika-app I'm getting the following result of parsing the attached > document: > {noformat} > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="subject" content="tests"/> > <meta name="Content-Length" content="2235"/> > <meta name="comment" content="StarWriter"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.rtf.RTFParser"/> > <meta name="Content-Type" content="application/rtf"/> > <meta name="resourceName" content="test.rtf"/> > <title>test rft document</title> > </head> > <body/></html> > {noformat} > The expected result would be a non-empty body containing the text "The quick > brown fox jumps over the lazy dog > ". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira