Re: Memory Leak in 7.3 to 7.4

2018-08-06 Thread Tim Allison
+1 to Shawn's and Erick's points about isolating Tika in a separate jvm.

Y, please do let us know:  u...@tika.apache.org  We might be able to
help out, and you, in turn, can help the community figure out what's
going on; see e.g.: https://issues.apache.org/jira/browse/TIKA-2703
On Sun, Aug 5, 2018 at 1:22 PM Shawn Heisey  wrote:
>
> On 8/2/2018 5:30 AM, Thomas Scheffler wrote:
> > my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries 
> > just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage 
> > after about 85 % of the index process and manual trigger of the garbage 
> > collector is about 60-70 MB (That low!!!)
> >
> > My problem now is that we have several setups that triggers this reliably 
> > but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. 
> > I also do not know if the error is inside Tika or inside the glue code that 
> > makes Tika usable in SOLR.
>
> If downgrading Tika fixes the issue, then it doesn't seem (to me) very
> likely that Solr's glue code for ERH has a problem. If it's not Solr's
> code that has the problem, there will be nothing we can do about it
> other than change the Tika library included with Solr.
>
> Before filing an issue, you should discuss this with the Tika project on
> their mailing list.  They'll want to make sure that they can fix the
> problem in a future version.  It might not be an actual memory leak ...
> it could just be that one of the documents you're trying to index is one
> that Tika requires a huge amount of memory to handle.  But it could be a
> memory leak.
>
> If you know which document is being worked on when it runs out of
> memory, can you try not including that document in your indexing, to see
> if it still has a problem?
>
> Please note that it is strongly recommended that you do not use the
> Extracting Request Handler in production.  Tika is prone to many
> problems, and those problems will generally affect Solr if Tika is being
> run inside Solr.  Because of this, it is recommended that you write a
> separate program using Tika that handles extracting information from
> documents and sending that data to Solr.  If that program crashes, Solr
> remains operational.
>
> There is already an issue to upgrade Tika to the latest version in Solr,
> but you've said that you tried 1.18 already with no change to the
> problem.  So whatever the problem is, it will need to be solved in 1.19
> or later.
>
> Thanks,
> Shawn
>


Re: Memory Leak in 7.3 to 7.4

2018-08-05 Thread Shawn Heisey

On 8/2/2018 5:30 AM, Thomas Scheffler wrote:

my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just 
for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after 
about 85 % of the index process and manual trigger of the garbage collector is 
about 60-70 MB (That low!!!)

My problem now is that we have several setups that triggers this reliably but 
there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also 
do not know if the error is inside Tika or inside the glue code that makes Tika 
usable in SOLR.


If downgrading Tika fixes the issue, then it doesn't seem (to me) very 
likely that Solr's glue code for ERH has a problem. If it's not Solr's 
code that has the problem, there will be nothing we can do about it 
other than change the Tika library included with Solr.


Before filing an issue, you should discuss this with the Tika project on 
their mailing list.  They'll want to make sure that they can fix the 
problem in a future version.  It might not be an actual memory leak ... 
it could just be that one of the documents you're trying to index is one 
that Tika requires a huge amount of memory to handle.  But it could be a 
memory leak.


If you know which document is being worked on when it runs out of 
memory, can you try not including that document in your indexing, to see 
if it still has a problem?


Please note that it is strongly recommended that you do not use the 
Extracting Request Handler in production.  Tika is prone to many 
problems, and those problems will generally affect Solr if Tika is being 
run inside Solr.  Because of this, it is recommended that you write a 
separate program using Tika that handles extracting information from 
documents and sending that data to Solr.  If that program crashes, Solr 
remains operational.


There is already an issue to upgrade Tika to the latest version in Solr, 
but you've said that you tried 1.18 already with no change to the 
problem.  So whatever the problem is, it will need to be solved in 1.19 
or later.


Thanks,
Shawn



Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Vincenzo D'Amore
Does this script also saves a memory dump of jvm?

Ciao,
Vincenzo

--
mobile: 3498513251
skype: free.dev

> On 2 Aug 2018, at 17:53, Erick Erickson  wrote:
> 
> Thomas:
> 
> You've obviously done a lot of work to track this, but maybe you can
> do even more ;).
> 
> Here's a link to a program that uses Tika to parse docs _on the client_:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> If you take out all the DB and Solr parts, you're left with something
> that just parses docs with Tika. My idea here is to feed it your docs
> and see if there are these noticeable memory differences between the
> versions of Tika.  A heap dump if there are would help the Tika folks
> enormously in tracking this down.
> 
> And if there's no memory creep, that points toward the glue code in Solr.
> 
> I also have to add that this kind of thing is one of the reasons we
> generally recommend that production systems do not use
> ExtractingRequestHandler. There are other reasons outlined in the link
> above
> 
> Best,
> Erick
> 
> On Thu, Aug 2, 2018 at 4:30 AM, Thomas Scheffler
>  wrote:
>> Hi,
>> 
>> my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries 
>> just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage 
>> after about 85 % of the index process and manual trigger of the garbage 
>> collector is about 60-70 MB (That low!!!)
>> 
>> My problem now is that we have several setups that triggers this reliably 
>> but there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. 
>> I also do not know if the error is inside Tika or inside the glue code that 
>> makes Tika usable in SOLR.
>> 
>> Should I file an issue for this?
>> 
>> kind regards,
>> 
>> Thomas
>> 
>> 
>>> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler 
>>> :
>>> 
>>> Hi,
>>> 
>>> we noticed a memory leak in a rather small setup. 40.000 metadata documents 
>>> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 
>>> has brought some tika issues (due to a beta version) the real problems 
>>> started to appear with version 7.3.0 which are currently unresolved in 
>>> 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was 
>>> enough, now 6G aren’t enough to index all files.
>>> I am now to a point where I can track this down to the libraries in 
>>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
>>> shipped with 7.2.1 the problem disappears. As most files are PDF documents 
>>> I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
>>> problem. I will next try to downgrade these single libraries back to 2.0.6 
>>> and 1.16 to see if these are the source of the memory leak.
>>> 
>>> In the mean time I would like to know if anybody else experienced the same 
>>> problems?
>>> 
>>> kind regards,
>>> 
>>> Thomas
>> 
>> 


Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Erick Erickson
Thomas:

You've obviously done a lot of work to track this, but maybe you can
do even more ;).

Here's a link to a program that uses Tika to parse docs _on the client_:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

If you take out all the DB and Solr parts, you're left with something
that just parses docs with Tika. My idea here is to feed it your docs
and see if there are these noticeable memory differences between the
versions of Tika.  A heap dump if there are would help the Tika folks
enormously in tracking this down.

And if there's no memory creep, that points toward the glue code in Solr.

I also have to add that this kind of thing is one of the reasons we
generally recommend that production systems do not use
ExtractingRequestHandler. There are other reasons outlined in the link
above

Best,
Erick

On Thu, Aug 2, 2018 at 4:30 AM, Thomas Scheffler
 wrote:
> Hi,
>
> my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries 
> just for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage 
> after about 85 % of the index process and manual trigger of the garbage 
> collector is about 60-70 MB (That low!!!)
>
> My problem now is that we have several setups that triggers this reliably but 
> there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I 
> also do not know if the error is inside Tika or inside the glue code that 
> makes Tika usable in SOLR.
>
> Should I file an issue for this?
>
> kind regards,
>
> Thomas
>
>
>> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler 
>> :
>>
>> Hi,
>>
>> we noticed a memory leak in a rather small setup. 40.000 metadata documents 
>> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 
>> has brought some tika issues (due to a beta version) the real problems 
>> started to appear with version 7.3.0 which are currently unresolved in 
>> 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was 
>> enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in 
>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
>> shipped with 7.2.1 the problem disappears. As most files are PDF documents I 
>> tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
>> problem. I will next try to downgrade these single libraries back to 2.0.6 
>> and 1.16 to see if these are the source of the memory leak.
>>
>> In the mean time I would like to know if anybody else experienced the same 
>> problems?
>>
>> kind regards,
>>
>> Thomas
>
>


Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Thomas Scheffler
Hi,

my final verdict is the upgrade to Tika 1.17. If I downgrade the libraries just 
for tika back to 1.16 and keep the rest of SOLR 7.4.0 the heap usage after 
about 85 % of the index process and manual trigger of the garbage collector is 
about 60-70 MB (That low!!!)

My problem now is that we have several setups that triggers this reliably but 
there is no simple test case that „fails“ if Tika 1.17 or 1.18 is used. I also 
do not know if the error is inside Tika or inside the glue code that makes Tika 
usable in SOLR.

Should I file an issue for this?

kind regards,

Thomas


> Am 02.08.2018 um 12:06 schrieb Thomas Scheffler 
> :
> 
> Hi,
> 
> we noticed a memory leak in a rather small setup. 40.000 metadata documents 
> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 
> has brought some tika issues (due to a beta version) the real problems 
> started to appear with version 7.3.0 which are currently unresolved in 7.4.0. 
> Memory consumption is out-of-roof. Where previously 512MB heap was enough, 
> now 6G aren’t enough to index all files.
> I am now to a point where I can track this down to the libraries in 
> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
> shipped with 7.2.1 the problem disappears. As most files are PDF documents I 
> tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
> problem. I will next try to downgrade these single libraries back to 2.0.6 
> and 1.16 to see if these are the source of the memory leak.
> 
> In the mean time I would like to know if anybody else experienced the same 
> problems?
> 
> kind regards,
> 
> Thomas




signature.asc
Description: Message signed with OpenPGP


Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Thomas Scheffler
Hi,

SOLR is shipping with a script that handles OOM errors. And produces log files 
for every case with content like this:

Running OOM killer script for process 9015 for Solr on port 28080
Killed process 9015

This script works ;-)

kind regards

Thomas



> Am 02.08.2018 um 12:28 schrieb Vincenzo D'Amore :
> 
> Not clear if you had experienced an OOM error.
> 
> On Thu, Aug 2, 2018 at 12:06 PM Thomas Scheffler <
> thomas.scheff...@uni-jena.de> wrote:
> 
>> Hi,
>> 
>> we noticed a memory leak in a rather small setup. 40.000 metadata
>> documents with nearly as much files that have „literal.*“ fields with it.
>> While 7.2.1 has brought some tika issues (due to a beta version) the real
>> problems started to appear with version 7.3.0 which are currently
>> unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously
>> 512MB heap was enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in
>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries
>> shipped with 7.2.1 the problem disappears. As most files are PDF documents
>> I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the
>> problem. I will next try to downgrade these single libraries back to 2.0.6
>> and 1.16 to see if these are the source of the memory leak.
>> 
>> In the mean time I would like to know if anybody else experienced the same
>> problems?
>> 
>> kind regards,
>> 
>> Thomas
>> 
> 
> 
> --
> Vincenzo D'Amore




signature.asc
Description: Message signed with OpenPGP


Re: Memory Leak in 7.3 to 7.4

2018-08-02 Thread Vincenzo D'Amore
Not clear if you had experienced an OOM error.

In the meanwhile, if you haven't already added, this can be useful:

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/store/solr-logs/dump.hprof

This is my GC_TUNE config - a 32GB server and 16GB reserved for JVM
(-Xms16G -Xmx16G)

export GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/store/solr-logs/dump.hprof \
"


On Thu, Aug 2, 2018 at 12:06 PM Thomas Scheffler <
thomas.scheff...@uni-jena.de> wrote:

> Hi,
>
> we noticed a memory leak in a rather small setup. 40.000 metadata
> documents with nearly as much files that have „literal.*“ fields with it.
> While 7.2.1 has brought some tika issues (due to a beta version) the real
> problems started to appear with version 7.3.0 which are currently
> unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously
> 512MB heap was enough, now 6G aren’t enough to index all files.
> I am now to a point where I can track this down to the libraries in
> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries
> shipped with 7.2.1 the problem disappears. As most files are PDF documents
> I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the
> problem. I will next try to downgrade these single libraries back to 2.0.6
> and 1.16 to see if these are the source of the memory leak.
>
> In the mean time I would like to know if anybody else experienced the same
> problems?
>
> kind regards,
>
> Thomas
>


-- 
Vincenzo D'Amore


Memory Leak in 7.3 to 7.4

2018-08-02 Thread Thomas Scheffler
Hi,

we noticed a memory leak in a rather small setup. 40.000 metadata documents 
with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has 
brought some tika issues (due to a beta version) the real problems started to 
appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory 
consumption is out-of-roof. Where previously 512MB heap was enough, now 6G 
aren’t enough to index all files.
I am now to a point where I can track this down to the libraries in 
solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
shipped with 7.2.1 the problem disappears. As most files are PDF documents I 
tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
problem. I will next try to downgrade these single libraries back to 2.0.6 and 
1.16 to see if these are the source of the memory leak.

In the mean time I would like to know if anybody else experienced the same 
problems?

kind regards,

Thomas


signature.asc
Description: Message signed with OpenPGP