[ 
https://issues.apache.org/jira/browse/TIKA-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno updated TIKA-3238:
------------------------
          Flags: Important
    Description: 
Some RTF files, when created in libreoffice writer seem to not be parsed 
correctly. The RTFParser seems to extract only a portion of the text (ex: the 
title).

However if the same file is opened in a Windows Word and saved again as an RTF 
file, the parser is able to extract the full text.

An example file is attached in the ticket.

 

And this would be a small snippet of the parser:
{code:java}
private static final Set<MediaType> EXCLUDES = 
Collections.singleton(MediaType.application("x-tika-ooxml"));

private static final Parser PARSERS[] = new Parser[] {
        new RTFParser()
};

private static final AutoDetectParser PARSER_INSTANCE = new 
AutoDetectParser(PARSERS);

private static final Tika TIKA_INSTANCE = new 
Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);

public String parse(InputStream content) {
    return TIKA_INSTANCE.parseToString(content)
}{code}

  was:
Some RTF files, when created in libreoffice writer seem to not be parsed 
correctly. The RTFParser seems to extract only a portion of the text (ex: the 
title).

However if the same file is opened in a Windows Word and saved again as an RTF 
file, the parser is able to extract the full text.

An example is attached in the ticket.
{code:java}
private static final Set<MediaType> EXCLUDES = 
Collections.singleton(MediaType.application("x-tika-ooxml"));

private static final Parser PARSERS[] = new Parser[] {
        new RTFParser()
};

private static final AutoDetectParser PARSER_INSTANCE = new 
AutoDetectParser(PARSERS);

private static final Tika TIKA_INSTANCE = new 
Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
{code}

         Labels: Parser parse-tika  (was: )

> RTFParser fails to generate full content of an RTF file that has been 
> generated in libreoffice
> ----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3238
>                 URL: https://issues.apache.org/jira/browse/TIKA-3238
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Bruno
>            Priority: Minor
>              Labels: Parser, parse-tika
>         Attachments: file-sample_1MB (1).rtf
>
>
> Some RTF files, when created in libreoffice writer seem to not be parsed 
> correctly. The RTFParser seems to extract only a portion of the text (ex: the 
> title).
> However if the same file is opened in a Windows Word and saved again as an 
> RTF file, the parser is able to extract the full text.
> An example file is attached in the ticket.
>  
> And this would be a small snippet of the parser:
> {code:java}
> private static final Set<MediaType> EXCLUDES = 
> Collections.singleton(MediaType.application("x-tika-ooxml"));
> private static final Parser PARSERS[] = new Parser[] {
>         new RTFParser()
> };
> private static final AutoDetectParser PARSER_INSTANCE = new 
> AutoDetectParser(PARSERS);
> private static final Tika TIKA_INSTANCE = new 
> Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
> public String parse(InputStream content) {
>     return TIKA_INSTANCE.parseToString(content)
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to