[
https://issues.apache.org/jira/browse/TIKA-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850936#comment-16850936
]
Tim Allison commented on TIKA-2883:
-----------------------------------
I have a local fix that works for all four issues. I'll push that once I get a
clean local build.
There's still the remaining item for future improvements that we're pretty much
guessing when we're at the end of the header by whether we see {{par}} or other
text-y kinds of things.
According to one RTF spec, this is what a header can look like, with ? for
optional, obviously.
{noformat}
<header> \rtf <charset> <deffont> \deff? <fonttbl> <filetbl>?
<colortbl>? <stylesheet>? <listtables>? <revtbl>? <rsidtable>? <generator>?
{noformat}
The obnoxious part is that there can be stuff in between those items, and I'm
hesitant to trust that RTFs follow the spec and actually require that order,
etc...
> Text not extracted from RTF files
> ---------------------------------
>
> Key: TIKA-2883
> URL: https://issues.apache.org/jira/browse/TIKA-2883
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.20, 1.19.1, 1.21
> Reporter: Luis Filipe Nassif
> Assignee: Tim Allison
> Priority: Major
> Attachments: Message (5).rtf
>
>
> I have a number of RTF files (extracted fromĀ PST email bodies) which text is
> not extracted currently. Sample file attached. [[email protected]], do you
> have any ideia what is going on?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)