Re: issue for indexing large blobs into databases

Karl Wright Tue, 07 May 2019 04:09:29 -0700

The thread dump is not helpful.  I see no traces for any threads that
identify code from MCF at all.  That's new and has nothing to do with MCF;
it's probably a new Java "feature".


The OutOfMemoryException should, I've confirmed, be rethrown.  The Worker
Thread code does in fact catch the exception and terminate the process:

>>>>>>
        catch (OutOfMemoryError e)
        {
          System.err.println("agents process ran out of memory - shutting
down");
          e.printStackTrace(System.err);
          System.exit(-200);
        }
<<<<<<

So that is what you should see happening.

Are you providing the complete stack trace from your out-of-memory
exception?  That would help confirm the picture.  It had to have been
printed from somewhere; it goes to standard error, whatever that is.

Karl


On Tue, May 7, 2019 at 6:42 AM Karl Wright <daddy...@gmail.com> wrote:

> Hi Olivier,
>
> Any out-of-memory exception that makes it to the top level should cause
> the agents process to shut itself down.
>
> The reason for this is simple: in a multithread environment, with lots of
> third-party jars, an out-of-memory in one thread is usually indicative of
> an out-of-memory in other threads as well, and we cannot count on that
> third-party software being rigorous about letting an OutOfMemoryException
> be thrown to the top level.  So we tend to get corrupted results that are
> hard to recover from.
>
> In this case, the exception is being thrown within the JDBCConnector, so
> it ought to rethrow the OutOfMemoryException.  Let me confirm that is what
> the code indicates.
>
> Karl
>
>
> On Tue, May 7, 2019 at 6:14 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Thank you for your help.
>> Indeed I found a OoM error in the logs of the MCF-agent  process :
>> java.lang.OutOfMemoryError: Java heap space
>>         at com.mysql.jdbc.Buffer.<init>(Buffer.java:58)
>>         at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441)
>>         at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
>>         at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
>>         at
>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
>>         at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
>>         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
>>         at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
>>         at
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
>>         at
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
>>         at
>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>>
>>
>> But the MCF process agent was not killed, it is still active. Why is the
>> process still active in this case ? Since we know that we can get an OOM,
>> couldn't we find an elegant way to notify users of this problem from the
>> MCF admin UI, rather than just having the process endlessly active ?
>>
>> As for the OOM, the work around we decided to use is to increase the
>> values of xms and xmx java memory in options.env.unix.
>> I also attached the thread dump to this mail.
>>
>> Thanks,
>> Best regards,
>>
>> Olivier
>>
>> Le 6 mai 2019 à 12:55, Karl Wright <daddy...@gmail.com> a écrit :
>>
>> It sounds like there might be an out-of-memory situation, although if
>> that is the case I would expect that the MCF agents process would just shut
>> itself down.
>>
>> If the process is hung, it should be easy to get a thread dump.  That
>> would be the first step.
>>
>> Some thoughts as to what might be happening: there might be an
>> out-of-memory condition that is being silently eaten somewhere.  MCF relies
>> on its connectors to use streams rather than load documents into memory,
>> BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
>> would be to try to crawl your documents without the tika extractor (just
>> send them to the null output connector) and see if that succeeds.  If it
>> does, then maybe try the external Tika Service connector instead and see
>> what happens then.
>>
>> Karl
>>
>>
>> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard <
>> olivier.tav...@francelabs.com> wrote:
>>
>>> Hi MCF community,
>>>
>>> We have some issues with the JDBC connector to index database with large
>>> LONGBLOB into it : I mean files more than 100 MB.
>>> To reproduce the issue, I created a simple database on MySQL 5.7 with
>>> only one table into it with 2 columns : id (int)  and data (longblob).
>>> There are 8 records in the table : two records with very small files :
>>> less than 5 KB and 6 records with a large file of 192 MB.
>>>
>>> If I include in my crawl only one big BLOB, the crawl is successful.
>>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>>>
>>> But if I include more than 2 of these (with a WHERE IN condition or if
>>> select all the records of the table) the job stays in running condition but
>>> no document is processed :
>>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7)
>>>
>>>
>>>
>>> There is no message in the logs (even with debug mode activated).
>>> The tests were done in a Docker cluster on dedicated servers (4 cores,
>>> 32 GB RAM each).
>>> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx
>>> and xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper
>>> based synchronization.
>>> We have a dump of the database here :
>>> https://www.datafari.com/files/mcftest.tar.gz
>>>
>>> In other cases, we also encounter this error :
>>> Error: Unexpected jobqueue status - record id 1555190422690, expecting
>>> active status, saw 0
>>>
>>> Any ideas of what the problem could be ? or at least why there is
>>> nothing in the debug log you think ?
>>>
>>> Thank you,
>>> Best regards,
>>>
>>> Olivier
>>>
>>
>>

Re: issue for indexing large blobs into databases

Reply via email to