[jira] [Commented] (TIKA-1030) Page extraction for Word,Excel Documents

Nick Burch (JIRA) Fri, 23 Nov 2012 08:17:06 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503239#comment-13503239
 ]


Nick Burch commented on TIKA-1030:
----------------------------------

Excel isn't a page based format, so there's no page information to return

Generally, word doesn't store page information in the file, it normally 
computes it on the fly based on the page / printer / font settings. There may 
be some paging information in the file, at the very least forced page breaks. 
We could like at adding those in, but you won't get the same thing as in a PDF 
as the file format isn't set out the same way. (PDF is a page based format, 
Word is more similar to something like html in terms of text with styling)
                
> Page extraction for Word,Excel Documents
> ----------------------------------------
>
>                 Key: TIKA-1030
>                 URL: https://issues.apache.org/jira/browse/TIKA-1030
>             Project: Tika
>          Issue Type: Improvement
>         Environment: For use with Solr
>            Reporter: David vandendriessche
>              Labels: solr_cell, tika
>
> I would like to extract pages from word doc's and excel sheets. 
> Reason: I'm using solr to search files and give page hit results. For this I 
> used pdfbox for page extraction. Now I would like to upload other doctypes 
> but I can't seem to find paging support for it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1030) Page extraction for Word,Excel Documents

Reply via email to