https://issues.apache.org/bugzilla/show_bug.cgi?id=50428
Summary: Need a way to avoid OutOfMemoryError's in
RawDataBlockList
Product: POI
Version: unspecified
Platform: PC
OS/Version: Windows XP
Status: NEW
Severity: critical
Priority: P2
Component: POIFS
AssignedTo: [email protected]
ReportedBy: [email protected]
We're dealing with a scenario where very large MS Office files are being
processed, with a tight limit on the heap size to be 100MB.
This causes OutOfMemoryError's in RawDataBlockList.
java.lang.OutOfMemoryError: KERNEL-10 : Java heap space
at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:68)
at
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:53)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:155)
RawDataBlockList loads all the blocks till end of file. Is there any way to
limit this, perhaps having there be an optional "sliding window"-ful of blocks
which gets repopulated on demand?
As a quicker fix, it'd be sufficient to have a way to ascertain whether a given
Office file is Excel, Word, or PPT. The way we do this is, once we know it's an
Office doc, by examining the magic bytes, we try to read the 'application name'
within the POI fs:
public boolean isRecognized(DocumentPayload payload) {
String application = null;
try {
application = getApplicationName(payload.getContentStream(),
payload.getDocId());
} catch (Exception ex) {
log.warn(TextExtractionError.ERROR, ex, "NON-FATAL error (proceeding with
text extraction). Failed to determine application for document. Payload: %s.",
payload);
}
return (application == null) ? false :
application.toLowerCase().contains(EXCEL) &&
application.toLowerCase().contains(MICROSOFT);
}
Where
protected String getApplicationName(InputStream is, String docId) throws
IOException {
String application = null;
try {
POIFSFileSystem filesystem = new POIFSFileSystem(is);
// First, try to extract the application name from the metadata
SummaryInformation si = null;
PropertySet ps2 = getPropertySet(filesystem,
SummaryInformation.DEFAULT_STREAM_NAME, docId);
if (ps2 instanceof SummaryInformation) {
si = (SummaryInformation) ps2;
}
application = (si == null) ? null :
StringUtils.trim(si.getApplicationName());
// Unfortunately, the app name may not be present in the document
metadata.
// If that is the case, see if the file system has an entry by which we
can tell
// that the document matches the type.
if (StringUtils.isEmpty(application) &&
hasDistinguishedEntry(filesystem)) {
application = getDefaultApplicationName();
}
} finally {
is.close();
}
return application;
}
And 'hasDistinguishedName' is as follows, e.g. for Excel
protected boolean hasDistinguishedEntry(POIFSFileSystem filesystem) {
boolean hasIt = true;
// See if the Workbook entry is there
try {
filesystem.getRoot().getEntry("Workbook");
} catch (FileNotFoundException fe) {
// Try the upper case form
try {
filesystem.getRoot().getEntry("WORKBOOK");
} catch (FileNotFoundException wfe) {
// Try Book
try {
filesystem.getRoot().getEntry("Book");
} catch (FileNotFoundException wfee) {
hasIt = false;
}
}
}
return hasIt;
}
If we can avoid doing all this, then the OutOfMemory issue becomes less
significant. Otherwise we need a way to curtail the memory consumption on the
blocklist side and still be able to have access to properties and entries.
Any advise/recommendations?
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]