[ 
https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441
 ] 

Jeremy Anderson edited comment on TIKA-733 at 10/3/11 6:06 PM:
---------------------------------------------------------------

(Sorry, I can't seem to get the post to maintain my newline characters :( )

The problem is also present in the older 0.9 release.


Looking at the document as you suggested, the document is corrupt/malformed in 
the sense that it contains more closing brackets '}' than opening brackets '{'.


However with that said, the text contained with in the document appears to 
still be extractable for this document using the patch I submitted that ignores 
the group state once empty.


My knowledge on RTF formats is rather limited, but is there perhaps a better 
compromise that will allow the parser to return the text it is able to get and 
maybe log a warning condition when a malformed RTF is encountered?



I have about 20 or so files that have encountered this failure in my load set.  
I haven't had the time to investigate all of them yet to see if they all fail 
for the same mis-matched problem, and when corrupt, determine how much of the 
extractable text is impacted by the fix I submitted.



To be noted, both Word pad and MS word are able to open these files without 
issue... though thats to be expected.  I expect that they may also just ignore 
the final block in these cases.  Actually after opening the failed document and 
resaving it in WordPad, the final partial block does indeed just get truncated 
from the file.



Looking closer at the file in a text editor, the culprit final extra block 
appears to be a partial replication of the final valid ending block in the 
file.  Perhaps an appropriate fix for being able to auto-handle these partial 
corrupted RTF's is to:


* detect if they have more ending blocks than starting, and when it does


* check to see if the final one is a partial replication of the prior one


* and if so, just ignore the final one.



Last lines of the corrupted file:
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
      was (Author: rpialum):
    The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in 
the sense that it contains more closing brackets '}' than opening brackets '{'.


However with that said, the text contained with in the document appears to 
still be extractable for this document using the patch I submitted that ignores 
the group state once empty


My knowledge on RTF formats is rather limited, but is there perhaps a better 
compromise that will allow the parser to return the text it is able to get and 
maybe log a warning condition when a malformed RTF is encountered?


I have about 20 or so files that have encountered this failure in my load set.  
I haven't had the time to investigate all of them yet to see if they all fail 
for the same mis-matched problem, and when corrupt, determine how much of the 
extractable text is impacted by the fix I submitted.


To be noted, both Word pad and MS word are able to open these files without 
issue... though thats to be expected.  I expect that they may also just ignore 
the final block in these cases.  Actually after opening the failed document and 
resaving it in WordPad, the final partial block does indeed just get truncated 
from the file.


Looking closer at the file in a text editor, the culprit final extra block 
appears to be a partial replication of the final valid ending block in the 
file.  Perhaps an appropriate fix for being able to auto-handle these partial 
corrupted RTF's is to:

* detect if they have more ending blocks than starting, and when it does\n
* check to see if the final one is a partial replication of the prior one

* and if so, just ignore the final one.


Last lines of the corrupted file:
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                  
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: 
> TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the 
> groupStates() list when the list is empty.  Added a check to not perform the 
> logic when the list is empty, thus causing the restore group state to not be 
> performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
>       at java.util.LinkedList.remove(LinkedList.java:788)
>       at java.util.LinkedList.removeLast(LinkedList.java:144)
>       at 
> org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
>       at 
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
>       at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to