[ 
https://issues.apache.org/jira/browse/TIKA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Davies updated TIKA-2524:
-------------------------------
    Description: 
When we parse XPS files using the AutoParser we always get an empty string.
If we use DefaultDetector.detect() it correctly detects the MediaType as 
"application/vnd.ms-xpsdocument".

This page
https://tika.apache.org/1.16/formats.html
suggests that XPS (application/vnd.ms-xpsdocument) is supported however. 

Our code:
                InputStream bis = new BufferedInputStream(
                                this.getClass().getResourceAsStream("/" + 
EXPECTED_LOCATION + "doc_xps.xps"));
                Metadata metadata = new Metadata();
                BodyContentHandler handler = new BodyContentHandler();
                AutoDetectParser parser = new AutoDetectParser();
                TikaInputStream tikaStream = TikaInputStream.get(bis);
                parser.parse(tikaStream, handler, metadata);
                String parsedText = handler.toString();

I will attach doc_xps.xps if I can

  was:
When we parse XPS files using the AutoParser we always get an empty string.
If we use DefaultDetector.detect() it correctly detects the MediaType as 
"application/vnd.ms-xpsdocument".

This page
https://tika.apache.org/1.16/formats.html
suggests that XPS (application/vnd.ms-xpsdocument) is supported however. 




> Apache Tika returns empty string when parsing text from XPS files
> -----------------------------------------------------------------
>
>                 Key: TIKA-2524
>                 URL: https://issues.apache.org/jira/browse/TIKA-2524
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Peter Davies
>              Labels: features
>
> When we parse XPS files using the AutoParser we always get an empty string.
> If we use DefaultDetector.detect() it correctly detects the MediaType as 
> "application/vnd.ms-xpsdocument".
> This page
> https://tika.apache.org/1.16/formats.html
> suggests that XPS (application/vnd.ms-xpsdocument) is supported however. 
> Our code:
>               InputStream bis = new BufferedInputStream(
>                               this.getClass().getResourceAsStream("/" + 
> EXPECTED_LOCATION + "doc_xps.xps"));
>               Metadata metadata = new Metadata();
>               BodyContentHandler handler = new BodyContentHandler();
>               AutoDetectParser parser = new AutoDetectParser();
>               TikaInputStream tikaStream = TikaInputStream.get(bis);
>               parser.parse(tikaStream, handler, metadata);
>               String parsedText = handler.toString();
> I will attach doc_xps.xps if I can



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to