subject:"issue for indexing large blobs into databases"

Re: issue for indexing large blobs into databases

2019-05-14 Thread Olivier Tavard

Hi Karl,

Thanks for your answer and the quick fix.

Best regards,

Olivier 

> Le 13 mai 2019 à 19:46, Karl Wright  a écrit :
> 
> I committed a fix for this hang to trunk.
> Karl
> 
> 
> On Mon, May 13, 2019 at 11:40 AM Karl Wright  wrote:
> 
>> Thanks -- this shows a classic deadlock occurring, where the System.exit()
>> call effectively blocks the thread that does it from exiting, and thus
>> blocks the system from shutting down.  This is a change of behavior in the
>> JVM; not sure when it happened, but this didn't happen before.  But it
>> easily addressed.
>> 
>> Karl
>> 
>> 
>> On Mon, May 13, 2019 at 10:58 AM Olivier Tavard <
>> olivier.tav...@francelabs.com> wrote:
>> 
>>> Hi Karl,
>>> 
>>> Thank you for your mail.
>>> I installed an ootb MCF 2.13 to have a clean installation. I launched the
>>> File-based multiprocess example and I did the previous tests I mentioned.
>>> I managed to see the error that you told me about :
>>> 
>>> agents process ran out of memory - shutting down
>>> java.lang.OutOfMemoryError: Java heap space
>>>at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155)
>>>at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513)
>>>at
>>> com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116)
>>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971)
>>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871)
>>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414)
>>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910)
>>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405)
>>>at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
>>>at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
>>>at
>>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
>>>at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
>>>at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
>>>at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
>>>at
>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
>>>at
>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
>>>at
>>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>>> 
>>> But the mcf-agent process is still active, it was not shut down.
>>> I did a new thread dump, I obtained this :
>>> Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode):
>>> 
>>> "Shutdown thread" #8 prio=5 os_prio=0 tid=0x7f8ccc014000 nid=0x436a
>>> in Object.wait() [0x7f8c2a9e8000]
>>>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>> at java.lang.Object.wait(Native Method)
>>> at
>>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248)
>>> - locked <0xe0930708> (a java.lang.Integer)
>>> at
>>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
>>> at
>>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
>>> at
>>> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
>>> - locked <0xeab32808> (a java.util.HashMap)
>>> at
>>> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
>>> at
>>> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540)
>>> - locked <0xeab24af8> (a java.util.ArrayList)
>>> - locked <0xeab24ca0> (a java.lang.Integer)
>>> at
>>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718)
>>> 
>>> "MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0
>>> tid=0x7f8c84061800 nid=0x4090 in Object.wait() [0x7f8c2aeed000]
>>>   java.lang.Thread.State: WAITING (on object monitor)
>>> at java.lang.Object.wait(Native Method)
>>> at java.lang.Object.wait(Object.java:502)
>>> at java.util.TimerThread.mainLoop(Timer.java:526)
>>> - locked <0xebe79d20> (a java.util.TaskQueue)
>>> at java.util.TimerThread.run(Timer.java:505)
>>> 
>>> "Connection pool reaper" #2565 daemon prio=5 os_prio=0
>>> tid=0x7f8cf809d000 nid=0x408e waiting on condition [0x7f8c2aceb000]
>>>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>>> at java.lang.Thread.sleep(Native Method)
>>> at
>>> org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152)
>>> 
>>> "Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x7f8cfc209800
>>> nid=0x35d8 in Object.wait() [0x7f8d0c6f4000]
>>>   java.lang.Thread.State: WAITING (on object monitor)
>>> at java.lang.Object.wait(Native Method)
>>> at java.lang.Thread.join(Thread.java:1252)
>>> - locked <0xeab24b28> (a
>>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread)
>>> at java.lang.Thread.join(Thread.java:1326)
>>> at
>>>

Re: issue for indexing large blobs into databases

2019-05-13 Thread Karl Wright

I committed a fix for this hang to trunk.
Karl


On Mon, May 13, 2019 at 11:40 AM Karl Wright  wrote:

> Thanks -- this shows a classic deadlock occurring, where the System.exit()
> call effectively blocks the thread that does it from exiting, and thus
> blocks the system from shutting down.  This is a change of behavior in the
> JVM; not sure when it happened, but this didn't happen before.  But it
> easily addressed.
>
> Karl
>
>
> On Mon, May 13, 2019 at 10:58 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Thank you for your mail.
>> I installed an ootb MCF 2.13 to have a clean installation. I launched the
>> File-based multiprocess example and I did the previous tests I mentioned.
>> I managed to see the error that you told me about :
>>
>> agents process ran out of memory - shutting down
>> java.lang.OutOfMemoryError: Java heap space
>> at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155)
>> at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513)
>> at
>> com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871)
>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414)
>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910)
>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405)
>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
>> at
>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
>> at
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
>> at
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
>> at
>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>>
>> But the mcf-agent process is still active, it was not shut down.
>> I did a new thread dump, I obtained this :
>> Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode):
>>
>> "Shutdown thread" #8 prio=5 os_prio=0 tid=0x7f8ccc014000 nid=0x436a
>> in Object.wait() [0x7f8c2a9e8000]
>>java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> at java.lang.Object.wait(Native Method)
>> at
>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248)
>> - locked <0xe0930708> (a java.lang.Integer)
>> at
>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
>> at
>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
>> at
>> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
>> - locked <0xeab32808> (a java.util.HashMap)
>> at
>> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
>> at
>> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540)
>> - locked <0xeab24af8> (a java.util.ArrayList)
>> - locked <0xeab24ca0> (a java.lang.Integer)
>> at
>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718)
>>
>> "MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0
>> tid=0x7f8c84061800 nid=0x4090 in Object.wait() [0x7f8c2aeed000]
>>java.lang.Thread.State: WAITING (on object monitor)
>> at java.lang.Object.wait(Native Method)
>> at java.lang.Object.wait(Object.java:502)
>> at java.util.TimerThread.mainLoop(Timer.java:526)
>> - locked <0xebe79d20> (a java.util.TaskQueue)
>> at java.util.TimerThread.run(Timer.java:505)
>>
>> "Connection pool reaper" #2565 daemon prio=5 os_prio=0
>> tid=0x7f8cf809d000 nid=0x408e waiting on condition [0x7f8c2aceb000]
>>java.lang.Thread.State: TIMED_WAITING (sleeping)
>> at java.lang.Thread.sleep(Native Method)
>> at
>> org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152)
>>
>> "Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x7f8cfc209800
>> nid=0x35d8 in Object.wait() [0x7f8d0c6f4000]
>>java.lang.Thread.State: WAITING (on object monitor)
>> at java.lang.Object.wait(Native Method)
>> at java.lang.Thread.join(Thread.java:1252)
>> - locked <0xeab24b28> (a
>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread)
>> at java.lang.Thread.join(Thread.java:1326)
>> at
>> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107)
>> at
>> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
>> at java.lang.Shutdown.runHooks(Shutdown.java:123)
>> at

Re: issue for indexing large blobs into databases

2019-05-13 Thread Olivier Tavard

Hi Karl,

Thank you for your mail.
I installed an ootb MCF 2.13 to have a clean installation. I launched the 
File-based multiprocess example and I did the previous tests I mentioned.
I managed to see the error that you told me about :

agents process ran out of memory - shutting down
java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155)
at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513)
at com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
at 
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
at 
com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
at 
org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)

But the mcf-agent process is still active, it was not shut down.
I did a new thread dump, I obtained this :
Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode):

"Shutdown thread" #8 prio=5 os_prio=0 tid=0x7f8ccc014000 nid=0x436a in 
Object.wait() [0x7f8c2a9e8000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248)
- locked <0xe0930708> (a java.lang.Integer)
at 
org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
at 
org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
at 
org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
- locked <0xeab32808> (a java.util.HashMap)
at 
org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
at 
org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540)
- locked <0xeab24af8> (a java.util.ArrayList)
- locked <0xeab24ca0> (a java.lang.Integer)
at 
org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718)

"MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0 
tid=0x7f8c84061800 nid=0x4090 in Object.wait() [0x7f8c2aeed000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at java.util.TimerThread.mainLoop(Timer.java:526)
- locked <0xebe79d20> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:505)

"Connection pool reaper" #2565 daemon prio=5 os_prio=0 tid=0x7f8cf809d000 
nid=0x408e waiting on condition [0x7f8c2aceb000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152)

"Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x7f8cfc209800 
nid=0x35d8 in Object.wait() [0x7f8d0c6f4000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
- locked <0xeab24b28> (a 
org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread)
at java.lang.Thread.join(Thread.java:1326)
at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107)
at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0xeaae4150> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:971)
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:851)

"Connection pool reaper" #13 daemon prio=5 os_prio=0 tid=0x7f8cfc04b000 
nid=0x35a0 waiting on condition [0x7f8d2c47b000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152)

"Service Thread" #7 daemon prio=9 os_prio=0 tid=0x7f8d3c0b6000 nid=0x359c 
runnable [0x]

Re: issue for indexing large blobs into databases

2019-05-07 Thread Karl Wright

The thread dump is not helpful.  I see no traces for any threads that
identify code from MCF at all.  That's new and has nothing to do with MCF;
it's probably a new Java "feature".

The OutOfMemoryException should, I've confirmed, be rethrown.  The Worker
Thread code does in fact catch the exception and terminate the process:

>>
catch (OutOfMemoryError e)
{
  System.err.println("agents process ran out of memory - shutting
down");
  e.printStackTrace(System.err);
  System.exit(-200);
}
<<

So that is what you should see happening.

Are you providing the complete stack trace from your out-of-memory
exception?  That would help confirm the picture.  It had to have been
printed from somewhere; it goes to standard error, whatever that is.

Karl


On Tue, May 7, 2019 at 6:42 AM Karl Wright  wrote:

> Hi Olivier,
>
> Any out-of-memory exception that makes it to the top level should cause
> the agents process to shut itself down.
>
> The reason for this is simple: in a multithread environment, with lots of
> third-party jars, an out-of-memory in one thread is usually indicative of
> an out-of-memory in other threads as well, and we cannot count on that
> third-party software being rigorous about letting an OutOfMemoryException
> be thrown to the top level.  So we tend to get corrupted results that are
> hard to recover from.
>
> In this case, the exception is being thrown within the JDBCConnector, so
> it ought to rethrow the OutOfMemoryException.  Let me confirm that is what
> the code indicates.
>
> Karl
>
>
> On Tue, May 7, 2019 at 6:14 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Thank you for your help.
>> Indeed I found a OoM error in the logs of the MCF-agent  process :
>> java.lang.OutOfMemoryError: Java heap space
>> at com.mysql.jdbc.Buffer.(Buffer.java:58)
>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441)
>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
>> at
>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
>> at
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
>> at
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
>> at
>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>>
>>
>> But the MCF process agent was not killed, it is still active. Why is the
>> process still active in this case ? Since we know that we can get an OOM,
>> couldn't we find an elegant way to notify users of this problem from the
>> MCF admin UI, rather than just having the process endlessly active ?
>>
>> As for the OOM, the work around we decided to use is to increase the
>> values of xms and xmx java memory in options.env.unix.
>> I also attached the thread dump to this mail.
>>
>> Thanks,
>> Best regards,
>>
>> Olivier
>>
>> Le 6 mai 2019 à 12:55, Karl Wright  a écrit :
>>
>> It sounds like there might be an out-of-memory situation, although if
>> that is the case I would expect that the MCF agents process would just shut
>> itself down.
>>
>> If the process is hung, it should be easy to get a thread dump.  That
>> would be the first step.
>>
>> Some thoughts as to what might be happening: there might be an
>> out-of-memory condition that is being silently eaten somewhere.  MCF relies
>> on its connectors to use streams rather than load documents into memory,
>> BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
>> would be to try to crawl your documents without the tika extractor (just
>> send them to the null output connector) and see if that succeeds.  If it
>> does, then maybe try the external Tika Service connector instead and see
>> what happens then.
>>
>> Karl
>>
>>
>> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard <
>> olivier.tav...@francelabs.com> wrote:
>>
>>> Hi MCF community,
>>>
>>> We have some issues with the JDBC connector to index database with large
>>> LONGBLOB into it : I mean files more than 100 MB.
>>> To reproduce the issue, I created a simple database on MySQL 5.7 with
>>> only one table into it with 2 columns : id (int)  and data (longblob).
>>> There are 8 records in the table : two records with very small files :
>>> less than 5 KB and 6 records with a large file of 192 MB.
>>>
>>> If I include in my crawl only one big BLOB, the crawl is successful.
>>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>>>
>>> But if I include more than 2 of these (with a WHERE IN condition or if
>>> select all the records of the table) the job stays in running condition but
>>> no document is

Re: issue for indexing large blobs into databases

2019-05-07 Thread Karl Wright

Hi Olivier,

Any out-of-memory exception that makes it to the top level should cause the
agents process to shut itself down.

The reason for this is simple: in a multithread environment, with lots of
third-party jars, an out-of-memory in one thread is usually indicative of
an out-of-memory in other threads as well, and we cannot count on that
third-party software being rigorous about letting an OutOfMemoryException
be thrown to the top level.  So we tend to get corrupted results that are
hard to recover from.

In this case, the exception is being thrown within the JDBCConnector, so it
ought to rethrow the OutOfMemoryException.  Let me confirm that is what the
code indicates.

Karl


On Tue, May 7, 2019 at 6:14 AM Olivier Tavard 
wrote:

> Hi Karl,
>
> Thank you for your help.
> Indeed I found a OoM error in the logs of the MCF-agent  process :
> java.lang.OutOfMemoryError: Java heap space
> at com.mysql.jdbc.Buffer.(Buffer.java:58)
> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441)
> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816)
> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467)
> at
> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510)
> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542)
> at
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734)
> at
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885)
> at
> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392)
>
>
> But the MCF process agent was not killed, it is still active. Why is the
> process still active in this case ? Since we know that we can get an OOM,
> couldn't we find an elegant way to notify users of this problem from the
> MCF admin UI, rather than just having the process endlessly active ?
>
> As for the OOM, the work around we decided to use is to increase the
> values of xms and xmx java memory in options.env.unix.
> I also attached the thread dump to this mail.
>
> Thanks,
> Best regards,
>
> Olivier
>
> Le 6 mai 2019 à 12:55, Karl Wright  a écrit :
>
> It sounds like there might be an out-of-memory situation, although if that
> is the case I would expect that the MCF agents process would just shut
> itself down.
>
> If the process is hung, it should be easy to get a thread dump.  That
> would be the first step.
>
> Some thoughts as to what might be happening: there might be an
> out-of-memory condition that is being silently eaten somewhere.  MCF relies
> on its connectors to use streams rather than load documents into memory,
> BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
> would be to try to crawl your documents without the tika extractor (just
> send them to the null output connector) and see if that succeeds.  If it
> does, then maybe try the external Tika Service connector instead and see
> what happens then.
>
> Karl
>
>
> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi MCF community,
>>
>> We have some issues with the JDBC connector to index database with large
>> LONGBLOB into it : I mean files more than 100 MB.
>> To reproduce the issue, I created a simple database on MySQL 5.7 with
>> only one table into it with 2 columns : id (int)  and data (longblob).
>> There are 8 records in the table : two records with very small files :
>> less than 5 KB and 6 records with a large file of 192 MB.
>>
>> If I include in my crawl only one big BLOB, the crawl is successful.
>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>>
>> But if I include more than 2 of these (with a WHERE IN condition or if
>> select all the records of the table) the job stays in running condition but
>> no document is processed :
>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7)
>>
>>
>>
>> There is no message in the logs (even with debug mode activated).
>> The tests were done in a Docker cluster on dedicated servers (4 cores, 32
>> GB RAM each).
>> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx and
>> xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper based
>> synchronization.
>> We have a dump of the database here :
>> https://www.datafari.com/files/mcftest.tar.gz
>>
>> In other cases, we also encounter this error :
>> Error: Unexpected jobqueue status - record id 1555190422690, expecting
>> active status, saw 0
>>
>> Any ideas of what the problem could be ? or at least why there is nothing
>> in the debug log you think ?
>>
>> Thank you,
>> Best regards,
>>
>> Olivier
>>
>
>

Re: issue for indexing large blobs into databases

2019-05-06 Thread Karl Wright

It sounds like there might be an out-of-memory situation, although if that
is the case I would expect that the MCF agents process would just shut
itself down.

If the process is hung, it should be easy to get a thread dump.  That would
be the first step.

Some thoughts as to what might be happening: there might be an
out-of-memory condition that is being silently eaten somewhere.  MCF relies
on its connectors to use streams rather than load documents into memory,
BUT we're at the mercy of the JDBC drivers and Tika.  So another experiment
would be to try to crawl your documents without the tika extractor (just
send them to the null output connector) and see if that succeeds.  If it
does, then maybe try the external Tika Service connector instead and see
what happens then.

Karl

On Mon, May 6, 2019 at 5:56 AM Olivier Tavard 
wrote:

> Hi MCF community,
>
> We have some issues with the JDBC connector to index database with large
> LONGBLOB into it : I mean files more than 100 MB.
> To reproduce the issue, I created a simple database on MySQL 5.7 with only
> one table into it with 2 columns : id (int)  and data (longblob).
> There are 8 records in the table : two records with very small files :
> less than 5 KB and 6 records with a large file of 192 MB.
>
> If I include in my crawl only one big BLOB, the crawl is successful.
> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6
>
> But if I include more than 2 of these (with a WHERE IN condition or if
> select all the records of the table) the job stays in running condition but
> no document is processed :
> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7)
>
>
>
> There is no message in the logs (even with debug mode activated).
> The tests were done in a Docker cluster on dedicated servers (4 cores, 32
> GB RAM each).
> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx and
> xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper based
> synchronization.
> We have a dump of the database here :
> https://www.datafari.com/files/mcftest.tar.gz
>
> In other cases, we also encounter this error :
> Error: Unexpected jobqueue status - record id 1555190422690, expecting
> active status, saw 0
>
> Any ideas of what the problem could be ? or at least why there is nothing
> in the debug log you think ?
>
> Thank you,
> Best regards,
>
> Olivier
>

Re: issue for indexing large blobs into databases

Re: issue for indexing large blobs into databases

Re: issue for indexing large blobs into databases

Re: issue for indexing large blobs into databases

Re: issue for indexing large blobs into databases

Re: issue for indexing large blobs into databases

6 matches

Site Navigation

Mail list logo

Footer information