Re: issue for indexing large blobs into databases

2019-05-07 Thread Karl Wright
The thread dump is not helpful.  I see no traces for any threads that
identify code from MCF at all.  That's new and has nothing to do with MCF;
it's probably a new Java "feature".

The OutOfMemoryException should, I've confirmed, be rethrown.  The Worker
Thread code does in fact catch the exception and terminate the process:

>>
catch (OutOfMemoryError e)
{
  System.err.println("agents process ran out of memory - shutting
down");
  e.printStackTrace(System.err);
  System.exit(-200);
}
<<

So that is what you should see happening.

Are you providing the complete stack trace from your out-of-memory
exception?  That would help confirm the picture.  It had to have been
printed from somewhere; it goes to standard error, whatever that is.

Karl


On Tue, May 7, 2019 at 6:42 AM Karl Wright  wrote:

> Hi Olivier,
>
> Any out-of-memory exception that makes it to the top level should cause
> the agents process to shut itself down.
>
> The reason for this is simple: in a multithread environment, with lots of
> third-party jars, an out-of-memory in one thread is usually indicative of
> an out-of-memory in other threads as well, and we cannot count on that
> third-party software being rigorous about letting an OutOfMemoryException
> be thrown to the top level.  So we tend to get corrupted results that are
> hard to recover from.
>
> In this case, the exception is being thrown within the JDBCConnector, so
> it ought to rethrow the OutOfMemoryException.  Let me confirm that is what
> the code indicates.
>
> Karl
>
>
> On Tue, May 7, 2019 at 6:14 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Thank you for your help.
>> Indeed I found a OoM error in the logs of the MCF-agent  process :
>> java.lang.OutOfMemoryError: Java heap space
>> at com.mysql.jdbc.Buffer.(Buffer.java:58)
>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441)
>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
>> at
>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
>> at
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
>> at
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
>> at
>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>>
>>
>> But the MCF process agent was not killed, it is still active. Why is the
>> process still active in this case ? Since we know that we can get an OOM,
>> couldn't we find an elegant way to notify users of this problem from the
>> MCF admin UI, rather than just having the process endlessly active ?
>>
>> As for the OOM, the work around we decided to use is to increase the
>> values of xms and xmx java memory in options.env.unix.
>> I also attached the thread dump to this mail.
>>
>> Thanks,
>> Best regards,
>>
>> Olivier
>>
>> Le 6 mai 2019 à 12:55, Karl Wright  a écrit :
>>
>> It sounds like there might be an out-of-memory situation, although if
>> that is the case I would expect that the MCF agents process would just shut
>> itself down.
>>
>> If the process is hung, it should be easy to get a thread dump.  That
>> would be the first step.
>>
>> Some thoughts as to what might be happening: there might be an
>> out-of-memory condition that is being silently eaten somewhere.  MCF relies
>> on its connectors to use streams rather than load documents into memory,
>> BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
>> would be to try to crawl your documents without the tika extractor (just
>> send them to the null output connector) and see if that succeeds.  If it
>> does, then maybe try the external Tika Service connector instead and see
>> what happens then.
>>
>> Karl
>>
>>
>> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard <
>> olivier.tav...@francelabs.com> wrote:
>>
>>> Hi MCF community,
>>>
>>> We have some issues with the JDBC connector to index database with large
>>> LONGBLOB into it : I mean files more than 100 MB.
>>> To reproduce the issue, I created a simple database on MySQL 5.7 with
>>> only one table into it with 2 columns : id (int)  and data (longblob).
>>> There are 8 records in the table : two records with very small files :
>>> less than 5 KB and 6 records with a large file of 192 MB.
>>>
>>> If I include in my crawl only one big BLOB, the crawl is successful.
>>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>>>
>>> But if I include more than 2 of these (with a WHERE IN condition or if
>>> select all the records of the table) the job stays in running condition but
>>> no document is 

Re: issue for indexing large blobs into databases

2019-05-07 Thread Karl Wright
Hi Olivier,

Any out-of-memory exception that makes it to the top level should cause the
agents process to shut itself down.

The reason for this is simple: in a multithread environment, with lots of
third-party jars, an out-of-memory in one thread is usually indicative of
an out-of-memory in other threads as well, and we cannot count on that
third-party software being rigorous about letting an OutOfMemoryException
be thrown to the top level.  So we tend to get corrupted results that are
hard to recover from.

In this case, the exception is being thrown within the JDBCConnector, so it
ought to rethrow the OutOfMemoryException.  Let me confirm that is what the
code indicates.

Karl


On Tue, May 7, 2019 at 6:14 AM Olivier Tavard 
wrote:

> Hi Karl,
>
> Thank you for your help.
> Indeed I found a OoM error in the logs of the MCF-agent  process :
> java.lang.OutOfMemoryError: Java heap space
> at com.mysql.jdbc.Buffer.(Buffer.java:58)
> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441)
> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
> at
> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
> at
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
> at
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
> at
> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>
>
> But the MCF process agent was not killed, it is still active. Why is the
> process still active in this case ? Since we know that we can get an OOM,
> couldn't we find an elegant way to notify users of this problem from the
> MCF admin UI, rather than just having the process endlessly active ?
>
> As for the OOM, the work around we decided to use is to increase the
> values of xms and xmx java memory in options.env.unix.
> I also attached the thread dump to this mail.
>
> Thanks,
> Best regards,
>
> Olivier
>
> Le 6 mai 2019 à 12:55, Karl Wright  a écrit :
>
> It sounds like there might be an out-of-memory situation, although if that
> is the case I would expect that the MCF agents process would just shut
> itself down.
>
> If the process is hung, it should be easy to get a thread dump.  That
> would be the first step.
>
> Some thoughts as to what might be happening: there might be an
> out-of-memory condition that is being silently eaten somewhere.  MCF relies
> on its connectors to use streams rather than load documents into memory,
> BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
> would be to try to crawl your documents without the tika extractor (just
> send them to the null output connector) and see if that succeeds.  If it
> does, then maybe try the external Tika Service connector instead and see
> what happens then.
>
> Karl
>
>
> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi MCF community,
>>
>> We have some issues with the JDBC connector to index database with large
>> LONGBLOB into it : I mean files more than 100 MB.
>> To reproduce the issue, I created a simple database on MySQL 5.7 with
>> only one table into it with 2 columns : id (int)  and data (longblob).
>> There are 8 records in the table : two records with very small files :
>> less than 5 KB and 6 records with a large file of 192 MB.
>>
>> If I include in my crawl only one big BLOB, the crawl is successful.
>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>>
>> But if I include more than 2 of these (with a WHERE IN condition or if
>> select all the records of the table) the job stays in running condition but
>> no document is processed :
>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7)
>>
>>
>>
>> There is no message in the logs (even with debug mode activated).
>> The tests were done in a Docker cluster on dedicated servers (4 cores, 32
>> GB RAM each).
>> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx and
>> xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper based
>> synchronization.
>> We have a dump of the database here :
>> https://www.datafari.com/files/mcftest.tar.gz
>>
>> In other cases, we also encounter this error :
>> Error: Unexpected jobqueue status - record id 1555190422690, expecting
>> active status, saw 0
>>
>> Any ideas of what the problem could be ? or at least why there is nothing
>> in the debug log you think ?
>>
>> Thank you,
>> Best regards,
>>
>> Olivier
>>
>
>