Programmer-yyds opened a new issue, #881:
URL: https://github.com/apache/poi/issues/881
**Description**
When reading embedded images in an XLSX file, if the image size and count
are both large, calling `XSSFPicture.getPictureData()` will immediately load
the entire image into memory (`byte[]`).
This quickly consumes heap space and leads to an `OutOfMemoryError`.
In multi-threaded batch reading, the problem is worse because memory usage
grows linearly with the number of threads.
------
**Use Case**
In real-world business scenarios, we often need to associate image data
with other columns in the same row. For example:
- Column A: Product ID
- Column B: Product Name
- Column C: Product Image
When parsing, we need to accurately locate the image using its **row number
and column number**, and combine it with the text or numeric columns in the
same row to form a complete business record.
Therefore, during image parsing, it is necessary to map the image data
along with its **row and column indices**, rather than simply returning a flat
collection of images.
This structure makes it easier to align with other columns’ data,
especially in batch reading or multi-threaded processing.
------
**Environment**
- Java 21
- Apache POI 5.4.0
- Test file: An XLSX file containing 1000 images, each 1 MB in size
------
**Steps to Reproduce**
```
public static Map<Integer, Map<Integer, PictureData>> getPictures(XSSFSheet
sheet) {
List<POIXMLDocumentPart> list = sheet.getRelations();
Map<Integer, Map<Integer, PictureData>> rowToDataMap = new
HashMap<>(list.size());
for (POIXMLDocumentPart part : list) {
if (part instanceof XSSFDrawing) {
XSSFDrawing drawing = (XSSFDrawing) part;
for (XSSFShape shape : drawing.getShapes()) {
XSSFPicture picture = (XSSFPicture) shape;
XSSFClientAnchor anchor = picture.getPreferredSize();
CTMarker marker = anchor.getFrom();
int row = marker.getRow() + 1;
int col = marker.getCol() + 1;
// Problem: This loads the entire image into memory as byte[]
PictureData pictureData = picture.getPictureData();
if (pictureData != null) {
rowToDataMap
.computeIfAbsent(row, r -> new HashMap<>())
.put(col, pictureData);
}
}
}
}
return rowToDataMap;
}
public static void main(String[] args) throws IOException {
try (XSSFWorkbook workbook = new
XSSFWorkbook(Files.newInputStream(Paths.get("large_images.xlsx")))) {
XSSFSheet sheet = workbook.getSheetAt(0);
Map<Integer, Map<Integer, PictureData>> pictures =
getPictures(sheet);
}
}
```
------
**Expected Behavior**
- Provide a **lazy loading** mechanism so that the image data is not loaded
into memory until explicitly requested
- Provide an API that returns an `InputStream` instead of a `byte[]`
- Allow users to skip image parsing and only retrieve positional metadata
------
**Actual Behavior**
- `getPictureData()` immediately loads the entire image into a `byte[]`,
which can easily cause OOM for large files or when reading in multiple threads
------
**Possible Solutions**
- **Lazy loading**: Only load image data when explicitly requested by the
user
- **Streaming read**: Return `InputStream` instead of `byte[]`
- **Skip mode**: Allow ignoring image parsing when opening the XLSX file
- **Temporary file storage**: Write image data to temp files to reduce
memory pressure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]