[ 
https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916738#action_12916738
 ] 

Geoff Jarrad commented on TIKA-506:
-----------------------------------

I didn't quite mean that Tika should output the same level of HTML as 
OpenOffice.org, merely that it would be nice if Tika's OfficeParser, 
OpenDocumentParser and OOXMLParser could output consistent HTML for the same 
document content represented in different formats. Currently there are 
ideosyncratic differences that mean the various formats each get analysed (post 
HTML output) slightly differently by my code.

As for the colour fonts, my boss is willing to let me have some time to work on 
it if it proves feasible for me to do. Do you have any explicit pointers to 
word and POI specs that might help me? I'm not quite sure where to start 
looking.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, 
> tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't 
> currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a 
> paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to