Re: tika-offheap-memory-leak

Tim Allison Tue, 21 Mar 2023 08:26:43 -0700

Doh.  You're right.  I should have read our documentation on parseToString():


<strong>NOTE:</strong> Unlike most other Tika methods that take an
* {@link InputStream}, this method will close the given stream for
* you as a convenience.

I thought that you had cleared up the slow building oom because of a
colleague using jni?  Are you still having problems or are you just
curious about 3 openings and 2 closings?  Let me break out the
debugger and take a look.

On Tue, Mar 21, 2023 at 7:49 AM Darren <ppgf...@gmail.com> wrote:
>
>
> public static ZipArchiveThresholdInputStream openZipStream(InputStream 
> stream) throws IOException {
>     // Peek at the first few bytes to sanity check
>     InputStream checkedStream = FileMagic.prepareToCheckMagic(stream);
>     verifyZipHeader(checkedStream);
>
>     // Open as a proper zip stream
>     return new ZipArchiveThresholdInputStream(new 
> ZipArchiveInputStream(checkedStream));
> }
>
>
> When calling parseToString , the code will run to upon code one time, after 
> that  init  ZipArchiveInputStream one time,  and will put data to  
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#inf,   and 
> zero time  for  
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#close.
>
> Shouldn't we close?   Please help me.   Thank you
>
>
> Tim Allison <talli...@apache.org> 于2023年3月19日周日 22:41写道：
>>
>> > it called ZipArchiveInputStream constructor three times(two for mediatype, 
>> > one for parse),  but only two times calling java.util.zip.Inflater#end()?
>>
>> Wait, are you calling close on your BufferedInputStream?
>>
>> On Sat, Mar 18, 2023 at 9:36 PM Darren <ppgf...@gmail.com> wrote:
>> >
>> > Thank you for your reply on weekend，Tim!
>> >
>> > In my program,  both methods (detect and parseToString) are used one after 
>> > another to get the medie type and plain text.
>> > We test millions file samples everyday, and notice the java heap is normal 
>> > but offheap is increasing until java progress was killed by linux 
>> > oom-killer.
>> > Because in my program, i don't use offheap by native code.
>> >
>> > Last night，i only test detect method to get medietype, it seems everything 
>> > is normal. Later i will test parseToString.
>> >
>> > And i will try your suggestion and test the program again. Thanks!
>> >
>> >
>> > Tim Allison <talli...@apache.org> 于2023年3月18日周六 19:46写道：
>> >>
>> >> Do you get the off heap problem only on parseToString and not on detect?
>> >>
>> >> Not part of your question, but I'd recommend using
>> >> TikaInputStream.get(file, metadata).  It is far more efficient for
>> >> zip-based files as well as PDFs and other parsers that require random
>> >> access.
>> >>
>> >> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <ppgf...@gmail.com> wrote:
>> >> >
>> >> > Firstly,  thank you for tika, she is great project!
>> >> >
>> >> > Recently, i run the tika(version 2.7.0) project and extract text from 
>> >> > document， i find java offheap is increasing until all the memory to the 
>> >> > 100%, and then killed by oom-killer.
>> >> >
>> >> > then i use pmap and dump data from memory(exclude the java heap), i 
>> >> > find they are like this:
>> >> >
>> >> > [ Content
>> >> >
>> >> > Types] . xM1PK
>> >> >
>> >> > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK 
>> >> > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2. 
>> >> > xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK
>> >> >
>> >> > word/ footer1 . xm1PK
>> >> >
>> >> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK 
>> >> > word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2. 
>> >> > jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK
>> >> >
>> >> > customxml/ itemProps2 .xm1PK
>> >> >
>> >> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 
>> >> > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK 
>> >> > customXm1 /itemProps1.xm1PK
>> >> >
>> >> >
>> >> >
>> >> > they are office document text，why they are in offheap?  so i doubt when 
>> >> > parse  office  document  it will cause memory leak.
>> >> >
>> >> > another infomation:  when i debug code on my own mac computer, using 
>> >> > xlsx as input file sample ,
>> >> > when it calling tika.detect, it called ZipArchiveInputStream 
>> >> > constructor twice, and the same times calling 
>> >> > java.util.zip.Inflater#end();
>> >> > but when it calling tika.parseToString,  it called 
>> >> > ZipArchiveInputStream constructor three times(two for mediatype, one 
>> >> > for parse),  but only two times calling java.util.zip.Inflater#end()?
>> >> >
>> >> > Is that caused the offheap memory leak because of the Inflater use 
>> >> > native code?
>> >> >
>> >> > Look forward for your reply!  thank you very much!
>> >> >
>> >> > my test code:
>> >> >
>> >> >     public static void extractByFacade(File file) throws Exception {
>> >> >         Tika tika = new Tika();
>> >> >         tika.setMaxStringLength(240);
>> >> >         org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000);
>> >> >
>> >> >         final BufferedInputStream buffer = new BufferedInputStream(new 
>> >> > FileInputStream(file));
>> >> >         final String mediaType = tika.detect(buffer, file.getName());
>> >> > //        System.out.println("mediaType->" + mediaType);
>> >> >
>> >> >         final String content = tika.parseToString(buffer);
>> >> > //        System.out.println("extractByFacade>>>>>>>>>>>>>>");
>> >> > //        System.out.println(content + "  " + content.length());
>> >> >     }
>> >> >
>> >> >

Re: tika-offheap-memory-leak

Reply via email to