This is an automated email from the ASF dual-hosted git repository.
tallison pushed a change to branch TIKA-4692-improve-ooxml-sax-parsers
in repository https://gitbox.apache.org/repos/asf/tika.git
from 04fb507c06 improve sax ooxml - docx and pptx tests - WIP
add 991a6297b4 TIKA-4327: update aws
add 1318754622 TIKA-4327: update kotlin
add 720e083421 TIKA-4327: put kotlin version variable in parent
add ee96f8834a TIKA-4327: update sqlite
add 48c9e93734 TIKA-4563 -- on main: cherry-pick updates from branch_3x
found during regression tests and the release process (#2699)
add 5ce15f2b39 Merge branch 'main' into TIKA-4692-improve-ooxml-sax-parsers
add 81b2f29c13 refactor based on fresh commoncrawl - WIP
add 1f6b3d04db refactor based on fresh commoncrawl - WIP
add 88307771f4 refactor based on fresh commoncrawl - WIP
add 3ebf8fd77b string index out of bounds exception
add 1d87374184 checkpoint - wip
add 18cd618ad0 checkpoint - wip
add 7117454ca1 checkpoint - wip
No new revisions were added by this update.
Summary of changes:
CHANGES.txt | 4 +
tika-bom/pom.xml | 61 ++++++
.../main/java/org/apache/tika/metadata/Office.java | 4 +
.../org/apache/tika/sax/XHTMLContentHandler.java | 27 +++
.../main/java/org/apache/tika/utils/DateUtils.java | 3 +
.../org/apache/tika/mime/tika-mimetypes.xml | 6 +-
.../org/apache/tika/eval/app/ExtractComparer.java | 57 +++++-
.../src/main/resources/comparison-reports-tags.xml | 25 +++
.../src/main/resources/comparison-reports.xml | 26 +++
tika-parent/pom.xml | 16 +-
.../parser/microsoft/AbstractPOIFSExtractor.java | 2 +-
...attingTagManager.java => InlineTagManager.java} | 98 ++++++++--
.../microsoft/ooxml/OOXMLTikaBodyPartHandler.java | 126 ++++++++++---
.../ooxml/OOXMLWordAndPowerPointTextHandler.java | 66 ++++---
.../microsoft/ooxml/ParagraphProperties.java | 9 +
.../ooxml/SXSLFPowerPointExtractorDecorator.java | 207 ++++++++++++++++-----
.../ooxml/SXWPFWordExtractorDecorator.java | 57 ++++--
.../ooxml/XSSFExcelExtractorDecorator.java | 23 ++-
.../microsoft/ooxml/XWPFBodyContentsHandler.java | 11 ++
.../parser/microsoft/ooxml/OOXMLDocxSAXTest.java | 2 +-
.../parser/microsoft/ooxml/OOXMLPptxSAXTest.java | 2 +-
.../resources/test-documents/testWORD_2006ml.docx | Bin 165566 -> 151733 bytes
.../java/org/apache/tika/parser/pkg/ZipParser.java | 4 +
tika-pipes/tika-pipes-config-store-ignite/pom.xml | 2 +-
.../tika-pipes-microsoft-graph/pom.xml | 1 -
25 files changed, 672 insertions(+), 167 deletions(-)
rename
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/{FormattingTagManager.java
=> InlineTagManager.java} (61%)