[ https://issues.apache.org/jira/browse/TIKA-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767677#comment-17767677 ]
David Xie commented on TIKA-3828: --------------------------------- Wanted to bump this thread, still observing this issue on Tika versions 2.7.0, 2.8.0, and 2.9.0 The issue still occurs on docs with only one section if there are multiple pages, especially if there are large amounts of text. Adding new pages and text results in different parts of the body getting dropped. All the docs I'm seeing this behavior for to fall under the `isLegacyOrAlternativePackaging` category. > OneNote Parser - Parsed Files are Missing Parts of the Content > -------------------------------------------------------------- > > Key: TIKA-3828 > URL: https://issues.apache.org/jira/browse/TIKA-3828 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.4.1, 1.28.4 > Reporter: Gordon Vidal > Priority: Major > Attachments: TestSection1 (1).one, TikaParserErrorScreenshot.png > > > OneNote files that I receive from Sharepoint Online are currently not parsed > correctly. See the attached screenshot and OneNote section file. > I have been able to consistently reproduce this issue doing the following: > * Create a OneNote Document with multiple sections. > * Edit the OneNote Document using the option "Open in Desktop App" and make > changes in different sections, saving between edits. I have used both OneNote > 2016 (Version 1808) and OneNote 2021 (Version 2108). > * Download a section of the OneNote Document using the Sharepoint Online > REST API > I will be investigating this issue myself as well. The Tika codebase is quite > new to me so any information about the status of this bug, the potential > cause and any plans to fix it would be very welcome. -- This message was sent by Atlassian Jira (v8.20.10#820010)