[ 
https://issues.apache.org/jira/browse/TIKA-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283116#comment-14283116
 ] 

Uwe Schindler commented on TIKA-1523:
-------------------------------------

Yes. I extracts just the metadata. So I think this is an issue with this old 
version of Word.

In fact when you open the file in Word, it of course shows the real pages and 
it also recalculates the count, but initially it also shows "1". But here, the 
metadata as saved in the file is simply "1" or maybe nothing (see below). POI 
does not "reflow" the layout to calculate that information.

This is why the metadata is only updated by the word processing program on 
opening and editing the file. If you instruct Word 2010 to open the file "read 
only" (which it does because its downloaded from internet), it shows "" in the 
page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or 
POI's issue.

> metadata extractor gets the wrong number of pages of some documents Microsoft 
> Word 9.0
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-1523
>                 URL: https://issues.apache.org/jira/browse/TIKA-1523
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.7
>         Environment: Ubuntu
>            Reporter: Yamileydis Veranes
>            Assignee: Konstantin Gribov
>         Attachments: Sigmund Freud.doc, screenshot-1.png
>
>
> When I extract the metadata from a Microsoft Word 9.0 document which has 10 
> pages extractor gives me the result that only has 1 page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to