https://bz.apache.org/bugzilla/show_bug.cgi?id=60471
Bug ID: 60471
Summary: Not loading AlternateContent in XWPF
Product: POI
Version: 3.16-dev
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: XWPF
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: ---
Created attachment 34522
--> https://bz.apache.org/bugzilla/attachment.cgi?id=34522&action=edit
triggering file based on testWORD_2006ml.docx in Tika
XWPFDocument's onDocumentLoad() looks for paragraphs, tables and sdts at the
main level of the body. As we saw with Bug 54849 (SDTs), there can be other
intervening structures between the body and text-containing elements.
I recently noticed that AlternateContent elements can also appear at the body
level, and we should probably add those to our document model.
To create this test file, I added a title page via Word's default "add a title
page function".
In the SAX parser that I added to Tika, I chose to extract text from the
Fallback section on the theory that that would have the more easily parseable
content. If we're modeling read/write in our DOM/XWPFDocument, we'll probably
want to point to both Fallback and Choice?
Unit test:
public void testAlternateContent() throws IOException {
XWPFDocument doc =
XWPFTestDataSamples.openSampleDocument("testAlternateContent.docx");
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
String txt = extractor.getText();
assertContainsSpecificCount("engaging abstract", txt, 1);
assertContainsSpecificCount("MyDocumentTitle", txt, 1);
assertContainsSpecificCount("MyDocumentSubtitle", txt, 1);
}
private void assertContainsSpecificCount(String needle, String haystack,
int expectedCount) {
int index = haystack.indexOf(needle);
int found = 0;
while (index > -1) {
found++;
index = haystack.indexOf(needle, index+1);
}
assertEquals(expectedCount, found);
}
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]