[ https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689131#comment-17689131 ]
David Avant commented on TIKA-3970: ----------------------------------- Sadly, I am not aware of any free, non-Microsoft viewers. But I have not spent much time searching. > Certain OneNote documents produce duplicate text > ------------------------------------------------ > > Key: TIKA-3970 > URL: https://issues.apache.org/jira/browse/TIKA-3970 > Project: Tika > Issue Type: Bug > Components: app > Affects Versions: 2.7.0 > Reporter: David Avant > Priority: Minor > Attachments: lyrics.docx, lyrics.one, lyrics.txt > > > Extracting text from certain OneNote documents produces more text than is > actually in the document. In this case, the OneNote document was created > by opening a Word document and "printing" it to the OneNote. > To reproduce the issue, open the attached "lyrics.one" using the Tika App > version 2.7.0 and view the plain text. Look for the phrase "Sunday > Morning" and observe that there are 14 occurrences. However in the actual > displayed text, it occurs only once. > The original text in this document is only about 12K characters, but the > extracted text from tika is over 300K. > -- This message was sent by Atlassian Jira (v8.20.10#820010)