[ 
https://issues.apache.org/jira/browse/TIKA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690971#comment-15690971
 ] 

Tim Allison commented on TIKA-2179:
-----------------------------------

How's this look:

{noformat}
0: cp:revision : 2
0: date : 2016-09-14T06:23:00Z
0: extended-properties:DocSecurity : 0
0: extended-properties:AppVersion : 14.0000
0: meta:word-count : 44
0: meta:paragraph-count : 1
0: dc:creator : Auti, Sneha (QAPM)
0: extended-properties:Company : Morgan Stanley
0: dcterms:created : 2016-09-14T06:16:00Z
0: meta:line-count : 2
0: dcterms:modified : 2016-09-14T06:23:00Z
0: Last-Modified : 2016-09-14T06:23:00Z
0: Last-Save-Date : 2016-09-14T06:23:00Z
0: meta:character-count : 257
0: meta:save-date : 2016-09-14T06:23:00Z
0: meta:character-count-with-spaces : 300
0: extended-properties:TotalTime : 7
0: modified : 2016-09-14T06:23:00Z
0: Content-Type : application/vnd.ms-word2006ml
0: X-Parsed-By : org.apache.tika.parser.DefaultParser
0: X-Parsed-By : org.apache.tika.parser.microsoft.ooxml.xwpf.Word2006MLParser
0: creator : Auti, Sneha (QAPM)
0: meta:author : Auti, Sneha (QAPM)
0: meta:creation-date : 2016-09-14T06:16:00Z
0: extended-properties:Application : Microsoft Office Word
0: Creation-Date : 2016-09-14T06:16:00Z
0: cp:lastModifiedBy : Auti, Sneha (QAPM)
0: extended-properties:Template : Normal.dotm
0: X-TIKA:parse_time_millis : 194
0: Author : Auti, Sneha (QAPM)
0: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.xwpf.Word2006MLParser" />
<meta name="Content-Type" content="application/vnd.ms-word2006ml" />
<title></title>
</head>
<body><p>It means that the guy that you are trading with was reported for a 
scam attempt. As the others mentioned, some of these BOFA could be false. </p>
<p>What's important is the current trade that you are doing. </p>
<p>If everything seems to be in order then there is nothing wrong with going 
through with the trade. </p>
<p />
</body></html>
0: meta:page-count : 1
{noformat}

> WordMLParser fails to parse a word xml file
> -------------------------------------------
>
>                 Key: TIKA-2179
>                 URL: https://issues.apache.org/jira/browse/TIKA-2179
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: OSX, java 8
>            Reporter: Sean Story
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: File5.xml
>
>
> h3. Problem
> I have a sample word xml file (attached as File5.xml) that can be parsed by 
> neither OOXMLParser (yields an exception that was {{Caused by: 
> org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied 
> data appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}) nor by OfficeParser (yields an exception like: 
> {{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data 
> appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}
> I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the 
> source, built, and updated my tika version to 1.14. However, when parsing 
> with WordMLParser, the output text content I get is the empty string {{""}}, 
> but I'm expecting something more like:
> {noformat}
> It means that the guy that you are trading with was reported for a scam 
> attempt. As the others mentioned, some of these BOFA could be false.
> What's important is the current trade that you are doing.
> If everything seems to be in order then there is nothing wrong with going 
> through with the trade.
> Auti, Sneha (QAPM)
> {noformat}
> h3. Replication
> You can replicate with the below Spock test
> {noformat}
>     def "display error with WordMLParser"(){
>         setup:
>         File input = new File("/Users/sstory/Downloads/File5.xml") //modify 
> for your path
>         Parser parser = new WordMLParser()
>         //Parser parser = new OOXMLParser()
>         //Parser parser = new OfficeParser()
>         org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
>         Metadata metadata = new Metadata()
>         ParseContext context = new ParseContext()
>         
>         when:
>         parser.parse(input.newInputStream(), textHandler, metadata, context)
>         String result = textHandler.toString()
>         then:
>         !result.isEmpty()
>         result.contains("the guy that you are trading with")
>         result.contains("BOFA")
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to