public static ZipArchiveThresholdInputStream openZipStream(InputStream stream) throws IOException { // Peek at the first few bytes to sanity check InputStream checkedStream = FileMagic.prepareToCheckMagic(stream); verifyZipHeader(checkedStream);
// Open as a proper zip stream return new ZipArchiveThresholdInputStream(new ZipArchiveInputStream(checkedStream)); } When calling parseToString , the code will run to upon code *one time*, after that init ZipArchiveInputStream *one time*, and will put data to org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#inf, and *zero time* for org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#close. Shouldn't we close? Please help me. Thank you Tim Allison <talli...@apache.org> 于2023年3月19日周日 22:41写道: > > it called ZipArchiveInputStream constructor three times(two for > mediatype, one for parse), but only two times calling > java.util.zip.Inflater#end()? > > Wait, are you calling close on your BufferedInputStream? > > On Sat, Mar 18, 2023 at 9:36 PM Darren <ppgf...@gmail.com> wrote: > > > > Thank you for your reply on weekend,Tim! > > > > In my program, both methods (detect and parseToString) are used one > after another to get the medie type and plain text. > > We test millions file samples everyday, and notice the java heap is > normal but offheap is increasing until java progress was killed by linux > oom-killer. > > Because in my program, i don't use offheap by native code. > > > > Last night,i only test detect method to get medietype, it seems > everything is normal. Later i will test parseToString. > > > > And i will try your suggestion and test the program again. Thanks! > > > > > > Tim Allison <talli...@apache.org> 于2023年3月18日周六 19:46写道: > >> > >> Do you get the off heap problem only on parseToString and not on detect? > >> > >> Not part of your question, but I'd recommend using > >> TikaInputStream.get(file, metadata). It is far more efficient for > >> zip-based files as well as PDFs and other parsers that require random > >> access. > >> > >> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <ppgf...@gmail.com> wrote: > >> > > >> > Firstly, thank you for tika, she is great project! > >> > > >> > Recently, i run the tika(version 2.7.0) project and extract text from > document, i find java offheap is increasing until all the memory to the > 100%, and then killed by oom-killer. > >> > > >> > then i use pmap and dump data from memory(exclude the java heap), i > find they are like this: > >> > > >> > [ Content > >> > > >> > Types] . xM1PK > >> > > >> > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2. > xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK > >> > > >> > word/ footer1 . xm1PK > >> > > >> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK > word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2. > jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK > >> > > >> > customxml/ itemProps2 .xm1PK > >> > > >> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1 > /itemProps1.xm1PK > >> > > >> > > >> > > >> > they are office document text,why they are in offheap? so i doubt > when parse office document it will cause memory leak. > >> > > >> > another infomation: when i debug code on my own mac computer, using > xlsx as input file sample , > >> > when it calling tika.detect, it called ZipArchiveInputStream > constructor twice, and the same times calling java.util.zip.Inflater#end(); > >> > but when it calling tika.parseToString, it called > ZipArchiveInputStream constructor three times(two for mediatype, one for > parse), but only two times calling java.util.zip.Inflater#end()? > >> > > >> > Is that caused the offheap memory leak because of the Inflater use > native code? > >> > > >> > Look forward for your reply! thank you very much! > >> > > >> > my test code: > >> > > >> > public static void extractByFacade(File file) throws Exception { > >> > Tika tika = new Tika(); > >> > tika.setMaxStringLength(240); > >> > > org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000); > >> > > >> > final BufferedInputStream buffer = new > BufferedInputStream(new FileInputStream(file)); > >> > final String mediaType = tika.detect(buffer, file.getName()); > >> > // System.out.println("mediaType->" + mediaType); > >> > > >> > final String content = tika.parseToString(buffer); > >> > // System.out.println("extractByFacade>>>>>>>>>>>>>>"); > >> > // System.out.println(content + " " + content.length()); > >> > } > >> > > >> > >