[jira] [Comment Edited] (TIKA-2179) WordMLParser fails to parse a word xml file

Tim Allison (JIRA) Wed, 23 Nov 2016 11:45:18 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691150#comment-15691150
 ]


Tim Allison edited comment on TIKA-2179 at 11/23/16 7:44 PM:
-------------------------------------------------------------

I committed a reasonable first pass at this.

Still left on the list for further work on other tickets (so that I don't 
forget):

1. Convert to a double pass...read the extra stuff first, then parse the main 
document.xml.  The list info comes after the document, and on a single pass, 
that and several other things fall by the wayside.  Hyperlinks happen to work, 
but that's only because those rels happen to come before document.xml in the 
test doc.
b. Add macro extraction from the ole.bin
iii. Make inline image markup consistent with xwpf
4. Figure out how to handle the chart data
E) include proper div markings for non main document content, footers, headers, 
etc.
VI - We are skipping "alternateContent" {{Fallback}} in favor of {{Choice}}.  
At least with the chart in the test file, this is not the right choice.  Which 
should we pick?

What this has that our current docx extractor doesn't at the moment:
1) no beans, purely read only <wild_speculation>should have better memory 
footprint</wild_speculation> (see also TIKA-1321)
2) ability to choose whether or not to extract deleted text (TIKA-2036)
3) ability to handle glossary document content  (TIKA-2163)
4) <wild_speculation>I _think_ this should be immune to the rare unicode bugs 
that we've seen with DOM...I need to test this (see 
TIKA-1961)</wild_speculation>
5) <wild_speculation>we're not likely to miss content because we're grabbing 
<w:t> wherever they are (TIKA-1317 and friends).  </wild_speculation>

On the down side...this re-invents several helper classes from POI and Tika 
(!), which I really, really regret.

Open questions:
1. [~gagravarr] and fellow devs, how does this look 
[commit|https://git-wip-us.apache.org/repos/asf?p=tika.git;a=commit;h=81fad8c97e60a3de7d926dc4ce10cbd235549583]?
  Anything crazy that ought to be fixed, including the mime-type?
2. Is there any way to move most of this into POI?  The current OPCPackage and 
the rest of the code appears to be tightly tied to ZipPackage and beans.  I 
could add this stuff as a standalone streaming/readonly xwpf set of objects, 
but do we want that in POI?
3.  What do you think of converting our current docx processing to these 
classes?  I don't think it would take much to rework a bit to pull the related 
bits from the zip and then process the document.xml as we're currently doing. 


was (Author: talli...@mitre.org):
I committed a reasonable first pass at this.

Still left on the list for further work on other tickets (so that I don't 
forget):

1. Convert to a double pass...read the extra stuff first, then parse the main 
document.xml.  The list info comes after the document, and on a single pass, 
that and several other things fall by the wayside.  Hyperlinks happen to work, 
but that's only because those rels happen to come before document.xml in the 
test doc.
b. Add macro extraction from the ole.bin
iii. Make inline image markup consistent with xwpf
4. Figure out how to handle the chart data
E) include proper div markings for non main document content, footers, headers, 
etc.
VI - We are skipping "alternateContent" {{Fallback}} in favor of {{Choice}}.  
At least with the chart in the test file, this is not the right choice.  Which 
should we pick?

What this has that our current docx extractor doesn't at the moment:
1) no beans, purely read only <wild_speculation>should have better memory 
footprint</wild_speculation>
2) ability to choose whether or not to extract deleted text (TIKA-2036)
3) ability to handle glossary document content 
4) <wild_speculation>I _think_ this should be immune to the rare unicode bugs 
that we've seen with DOM...I need to test this (see 
TIKA-1961)</wild_speculation>
5) <wild_speculation>we're not likely to miss content because we're grabbing 
<w:t> wherever they are (TIKA-1317 and friends).  </wild_speculation>

On the down side...this re-invents several helper classes from POI, which I 
really, really regret.

Open questions:
1. [~gagravarr] and fellow devs, how does this 
look[commit|https://git-wip-us.apache.org/repos/asf?p=tika.git;a=commit;h=81fad8c97e60a3de7d926dc4ce10cbd235549583]?
  Anything crazy that ought to be fixed, including the mime-type?
2. Is there any way to move most of this into POI?  The current OPCPackage and 
the rest of the code appears to be tightly tied to ZipPackage and beans.  I 
could add this stuff as a standalone streaming/readonly xwpf set of objects, 
but do we want that in POI?
3.  What do you think of converting our current docx processing to these 
classes?  I don't think it would take much to rework a bit to pull the related 
bits from the zip and then process the document.xml as we're currently doing. 

> WordMLParser fails to parse a word xml file
> -------------------------------------------
>
>                 Key: TIKA-2179
>                 URL: https://issues.apache.org/jira/browse/TIKA-2179
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: OSX, java 8
>            Reporter: Sean Story
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: File5.xml
>
>
> h3. Problem
> I have a sample word xml file (attached as File5.xml) that can be parsed by 
> neither OOXMLParser (yields an exception that was {{Caused by: 
> org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied 
> data appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}) nor by OfficeParser (yields an exception like: 
> {{org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data 
> appears to be a raw XML file. Formats such as Office 2003 XML are not 
> supported}}
> I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the 
> source, built, and updated my tika version to 1.14. However, when parsing 
> with WordMLParser, the output text content I get is the empty string {{""}}, 
> but I'm expecting something more like:
> {noformat}
> It means that the guy that you are trading with was reported for a scam 
> attempt. As the others mentioned, some of these BOFA could be false.
> What's important is the current trade that you are doing.
> If everything seems to be in order then there is nothing wrong with going 
> through with the trade.
> Auti, Sneha (QAPM)
> {noformat}
> h3. Replication
> You can replicate with the below Spock test
> {noformat}
>     def "display error with WordMLParser"(){
>         setup:
>         File input = new File("/Users/sstory/Downloads/File5.xml") //modify 
> for your path
>         Parser parser = new WordMLParser()
>         //Parser parser = new OOXMLParser()
>         //Parser parser = new OfficeParser()
>         org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
>         Metadata metadata = new Metadata()
>         ParseContext context = new ParseContext()
>         
>         when:
>         parser.parse(input.newInputStream(), textHandler, metadata, context)
>         String result = textHandler.toString()
>         then:
>         !result.isEmpty()
>         result.contains("the guy that you are trading with")
>         result.contains("BOFA")
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2179) WordMLParser fails to parse a word xml file

Reply via email to