[ https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692989#comment-17692989 ]
Nicholas DiPiazza commented on TIKA-3970: ----------------------------------------- So on Windows PC I log into [https://account.microsoft.com/services/microsoft365/details#install] Then click where it says Install Office Eventually you should have a copy of office installed on your machine. Then you should be able to open all the files: tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote1.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote3.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote4.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2007OrEarlier.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNote2016.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteEmbeddedWordDoc.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteFromOffice365.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testOneNoteFromOffice365-2.one tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/test-tika-3970-dupetext.one > Certain OneNote documents produce duplicate text > ------------------------------------------------ > > Key: TIKA-3970 > URL: https://issues.apache.org/jira/browse/TIKA-3970 > Project: Tika > Issue Type: Bug > Components: app > Affects Versions: 2.7.0 > Reporter: David Avant > Priority: Minor > Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, > lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, > lyrics.txt > > > Extracting text from certain OneNote documents produces more text than is > actually in the document. In this case, the OneNote document was created > by opening a Word document and "printing" it to the OneNote. > To reproduce the issue, open the attached "lyrics.one" using the Tika App > version 2.7.0 and view the plain text. Look for the phrase "Sunday > Morning" and observe that there are 14 occurrences. However in the actual > displayed text, it occurs only once. > The original text in this document is only about 12K characters, but the > extracted text from tika is over 300K. > -- This message was sent by Atlassian Jira (v8.20.10#820010)