Re: issue for indexing large blobs into databases
Hi Karl, Thanks for your answer and the quick fix. Best regards, Olivier > Le 13 mai 2019 à 19:46, Karl Wright a écrit : > > I committed a fix for this hang to trunk. > Karl > > > On Mon, May 13, 2019 at 11:40 AM Karl Wright wrote: > >> Thanks -- this shows a classic deadlock occurring, where the System.exit() >> call effectively blocks the thread that does it from exiting, and thus >> blocks the system from shutting down. This is a change of behavior in the >> JVM; not sure when it happened, but this didn't happen before. But it >> easily addressed. >> >> Karl >> >> >> On Mon, May 13, 2019 at 10:58 AM Olivier Tavard < >> olivier.tav...@francelabs.com> wrote: >> >>> Hi Karl, >>> >>> Thank you for your mail. >>> I installed an ootb MCF 2.13 to have a clean installation. I launched the >>> File-based multiprocess example and I did the previous tests I mentioned. >>> I managed to see the error that you told me about : >>> >>> agents process ran out of memory - shutting down >>> java.lang.OutOfMemoryError: Java heap space >>>at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155) >>>at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513) >>>at >>> com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116) >>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971) >>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) >>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414) >>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910) >>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405) >>>at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) >>>at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) >>>at >>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) >>>at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) >>>at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) >>>at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) >>>at >>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) >>>at >>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) >>>at >>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) >>> >>> But the mcf-agent process is still active, it was not shut down. >>> I did a new thread dump, I obtained this : >>> Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode): >>> >>> "Shutdown thread" #8 prio=5 os_prio=0 tid=0x7f8ccc014000 nid=0x436a >>> in Object.wait() [0x7f8c2a9e8000] >>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248) >>> - locked <0xe0930708> (a java.lang.Integer) >>> at >>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) >>> at >>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) >>> at >>> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) >>> - locked <0xeab32808> (a java.util.HashMap) >>> at >>> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540) >>> - locked <0xeab24af8> (a java.util.ArrayList) >>> - locked <0xeab24ca0> (a java.lang.Integer) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718) >>> >>> "MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0 >>> tid=0x7f8c84061800 nid=0x4090 in Object.wait() [0x7f8c2aeed000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at java.lang.Object.wait(Object.java:502) >>> at java.util.TimerThread.mainLoop(Timer.java:526) >>> - locked <0xebe79d20> (a java.util.TaskQueue) >>> at java.util.TimerThread.run(Timer.java:505) >>> >>> "Connection pool reaper" #2565 daemon prio=5 os_prio=0 >>> tid=0x7f8cf809d000 nid=0x408e waiting on condition [0x7f8c2aceb000] >>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> at java.lang.Thread.sleep(Native Method) >>> at >>> org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152) >>> >>> "Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x7f8cfc209800 >>> nid=0x35d8 in Object.wait() [0x7f8d0c6f4000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at java.lang.Thread.join(Thread.java:1252) >>> - locked <0xeab24b28> (a >>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread) >>> at java.lang.Thread.join(Thread.java:1326) >>> at >>>
Re: issue for indexing large blobs into databases
I committed a fix for this hang to trunk. Karl On Mon, May 13, 2019 at 11:40 AM Karl Wright wrote: > Thanks -- this shows a classic deadlock occurring, where the System.exit() > call effectively blocks the thread that does it from exiting, and thus > blocks the system from shutting down. This is a change of behavior in the > JVM; not sure when it happened, but this didn't happen before. But it > easily addressed. > > Karl > > > On Mon, May 13, 2019 at 10:58 AM Olivier Tavard < > olivier.tav...@francelabs.com> wrote: > >> Hi Karl, >> >> Thank you for your mail. >> I installed an ootb MCF 2.13 to have a clean installation. I launched the >> File-based multiprocess example and I did the previous tests I mentioned. >> I managed to see the error that you told me about : >> >> agents process ran out of memory - shutting down >> java.lang.OutOfMemoryError: Java heap space >> at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155) >> at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513) >> at >> com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116) >> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971) >> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) >> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414) >> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910) >> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405) >> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) >> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) >> at >> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) >> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) >> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) >> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) >> at >> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) >> at >> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) >> at >> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) >> >> But the mcf-agent process is still active, it was not shut down. >> I did a new thread dump, I obtained this : >> Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode): >> >> "Shutdown thread" #8 prio=5 os_prio=0 tid=0x7f8ccc014000 nid=0x436a >> in Object.wait() [0x7f8c2a9e8000] >>java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at >> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248) >> - locked <0xe0930708> (a java.lang.Integer) >> at >> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) >> at >> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) >> at >> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) >> - locked <0xeab32808> (a java.util.HashMap) >> at >> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) >> at >> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540) >> - locked <0xeab24af8> (a java.util.ArrayList) >> - locked <0xeab24ca0> (a java.lang.Integer) >> at >> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718) >> >> "MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0 >> tid=0x7f8c84061800 nid=0x4090 in Object.wait() [0x7f8c2aeed000] >>java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:502) >> at java.util.TimerThread.mainLoop(Timer.java:526) >> - locked <0xebe79d20> (a java.util.TaskQueue) >> at java.util.TimerThread.run(Timer.java:505) >> >> "Connection pool reaper" #2565 daemon prio=5 os_prio=0 >> tid=0x7f8cf809d000 nid=0x408e waiting on condition [0x7f8c2aceb000] >>java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at >> org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152) >> >> "Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x7f8cfc209800 >> nid=0x35d8 in Object.wait() [0x7f8d0c6f4000] >>java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Thread.join(Thread.java:1252) >> - locked <0xeab24b28> (a >> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread) >> at java.lang.Thread.join(Thread.java:1326) >> at >> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) >> at >> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) >> at java.lang.Shutdown.runHooks(Shutdown.java:123) >> at
Re: issue for indexing large blobs into databases
Hi Karl, Thank you for your mail. I installed an ootb MCF 2.13 to have a clean installation. I launched the File-based multiprocess example and I did the previous tests I mentioned. I managed to see the error that you told me about : agents process ran out of memory - shutting down java.lang.OutOfMemoryError: Java heap space at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155) at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513) at com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405) at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) at org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) But the mcf-agent process is still active, it was not shut down. I did a new thread dump, I obtained this : Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode): "Shutdown thread" #8 prio=5 os_prio=0 tid=0x7f8ccc014000 nid=0x436a in Object.wait() [0x7f8c2a9e8000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248) - locked <0xe0930708> (a java.lang.Integer) at org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) at org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) at org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) - locked <0xeab32808> (a java.util.HashMap) at org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) at org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540) - locked <0xeab24af8> (a java.util.ArrayList) - locked <0xeab24ca0> (a java.lang.Integer) at org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718) "MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0 tid=0x7f8c84061800 nid=0x4090 in Object.wait() [0x7f8c2aeed000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at java.util.TimerThread.mainLoop(Timer.java:526) - locked <0xebe79d20> (a java.util.TaskQueue) at java.util.TimerThread.run(Timer.java:505) "Connection pool reaper" #2565 daemon prio=5 os_prio=0 tid=0x7f8cf809d000 nid=0x408e waiting on condition [0x7f8c2aceb000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152) "Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x7f8cfc209800 nid=0x35d8 in Object.wait() [0x7f8d0c6f4000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) - locked <0xeab24b28> (a org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread) at java.lang.Thread.join(Thread.java:1326) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.exit(Shutdown.java:212) - locked <0xeaae4150> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:109) at java.lang.System.exit(System.java:971) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:851) "Connection pool reaper" #13 daemon prio=5 os_prio=0 tid=0x7f8cfc04b000 nid=0x35a0 waiting on condition [0x7f8d2c47b000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152) "Service Thread" #7 daemon prio=9 os_prio=0 tid=0x7f8d3c0b6000 nid=0x359c runnable [0x]
Re: issue for indexing large blobs into databases
The thread dump is not helpful. I see no traces for any threads that identify code from MCF at all. That's new and has nothing to do with MCF; it's probably a new Java "feature". The OutOfMemoryException should, I've confirmed, be rethrown. The Worker Thread code does in fact catch the exception and terminate the process: >> catch (OutOfMemoryError e) { System.err.println("agents process ran out of memory - shutting down"); e.printStackTrace(System.err); System.exit(-200); } << So that is what you should see happening. Are you providing the complete stack trace from your out-of-memory exception? That would help confirm the picture. It had to have been printed from somewhere; it goes to standard error, whatever that is. Karl On Tue, May 7, 2019 at 6:42 AM Karl Wright wrote: > Hi Olivier, > > Any out-of-memory exception that makes it to the top level should cause > the agents process to shut itself down. > > The reason for this is simple: in a multithread environment, with lots of > third-party jars, an out-of-memory in one thread is usually indicative of > an out-of-memory in other threads as well, and we cannot count on that > third-party software being rigorous about letting an OutOfMemoryException > be thrown to the top level. So we tend to get corrupted results that are > hard to recover from. > > In this case, the exception is being thrown within the JDBCConnector, so > it ought to rethrow the OutOfMemoryException. Let me confirm that is what > the code indicates. > > Karl > > > On Tue, May 7, 2019 at 6:14 AM Olivier Tavard < > olivier.tav...@francelabs.com> wrote: > >> Hi Karl, >> >> Thank you for your help. >> Indeed I found a OoM error in the logs of the MCF-agent process : >> java.lang.OutOfMemoryError: Java heap space >> at com.mysql.jdbc.Buffer.(Buffer.java:58) >> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441) >> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) >> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) >> at >> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) >> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) >> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) >> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) >> at >> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) >> at >> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) >> at >> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) >> >> >> But the MCF process agent was not killed, it is still active. Why is the >> process still active in this case ? Since we know that we can get an OOM, >> couldn't we find an elegant way to notify users of this problem from the >> MCF admin UI, rather than just having the process endlessly active ? >> >> As for the OOM, the work around we decided to use is to increase the >> values of xms and xmx java memory in options.env.unix. >> I also attached the thread dump to this mail. >> >> Thanks, >> Best regards, >> >> Olivier >> >> Le 6 mai 2019 à 12:55, Karl Wright a écrit : >> >> It sounds like there might be an out-of-memory situation, although if >> that is the case I would expect that the MCF agents process would just shut >> itself down. >> >> If the process is hung, it should be easy to get a thread dump. That >> would be the first step. >> >> Some thoughts as to what might be happening: there might be an >> out-of-memory condition that is being silently eaten somewhere. MCF relies >> on its connectors to use streams rather than load documents into memory, >> BUT we're at the mercy of the JDBC drivers and Tika. So another experiment >> would be to try to crawl your documents without the tika extractor (just >> send them to the null output connector) and see if that succeeds. If it >> does, then maybe try the external Tika Service connector instead and see >> what happens then. >> >> Karl >> >> >> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard < >> olivier.tav...@francelabs.com> wrote: >> >>> Hi MCF community, >>> >>> We have some issues with the JDBC connector to index database with large >>> LONGBLOB into it : I mean files more than 100 MB. >>> To reproduce the issue, I created a simple database on MySQL 5.7 with >>> only one table into it with 2 columns : id (int) and data (longblob). >>> There are 8 records in the table : two records with very small files : >>> less than 5 KB and 6 records with a large file of 192 MB. >>> >>> If I include in my crawl only one big BLOB, the crawl is successful. >>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6 >>> >>> But if I include more than 2 of these (with a WHERE IN condition or if >>> select all the records of the table) the job stays in running condition but >>> no document is
Re: issue for indexing large blobs into databases
Hi Olivier, Any out-of-memory exception that makes it to the top level should cause the agents process to shut itself down. The reason for this is simple: in a multithread environment, with lots of third-party jars, an out-of-memory in one thread is usually indicative of an out-of-memory in other threads as well, and we cannot count on that third-party software being rigorous about letting an OutOfMemoryException be thrown to the top level. So we tend to get corrupted results that are hard to recover from. In this case, the exception is being thrown within the JDBCConnector, so it ought to rethrow the OutOfMemoryException. Let me confirm that is what the code indicates. Karl On Tue, May 7, 2019 at 6:14 AM Olivier Tavard wrote: > Hi Karl, > > Thank you for your help. > Indeed I found a OoM error in the logs of the MCF-agent process : > java.lang.OutOfMemoryError: Java heap space > at com.mysql.jdbc.Buffer.(Buffer.java:58) > at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441) > at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) > at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) > at > com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) > at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) > at > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) > at > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) > at > org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) > > > But the MCF process agent was not killed, it is still active. Why is the > process still active in this case ? Since we know that we can get an OOM, > couldn't we find an elegant way to notify users of this problem from the > MCF admin UI, rather than just having the process endlessly active ? > > As for the OOM, the work around we decided to use is to increase the > values of xms and xmx java memory in options.env.unix. > I also attached the thread dump to this mail. > > Thanks, > Best regards, > > Olivier > > Le 6 mai 2019 à 12:55, Karl Wright a écrit : > > It sounds like there might be an out-of-memory situation, although if that > is the case I would expect that the MCF agents process would just shut > itself down. > > If the process is hung, it should be easy to get a thread dump. That > would be the first step. > > Some thoughts as to what might be happening: there might be an > out-of-memory condition that is being silently eaten somewhere. MCF relies > on its connectors to use streams rather than load documents into memory, > BUT we're at the mercy of the JDBC drivers and Tika. So another experiment > would be to try to crawl your documents without the tika extractor (just > send them to the null output connector) and see if that succeeds. If it > does, then maybe try the external Tika Service connector instead and see > what happens then. > > Karl > > > On Mon, May 6, 2019 at 5:56 AM Olivier Tavard < > olivier.tav...@francelabs.com> wrote: > >> Hi MCF community, >> >> We have some issues with the JDBC connector to index database with large >> LONGBLOB into it : I mean files more than 100 MB. >> To reproduce the issue, I created a simple database on MySQL 5.7 with >> only one table into it with 2 columns : id (int) and data (longblob). >> There are 8 records in the table : two records with very small files : >> less than 5 KB and 6 records with a large file of 192 MB. >> >> If I include in my crawl only one big BLOB, the crawl is successful. >> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6 >> >> But if I include more than 2 of these (with a WHERE IN condition or if >> select all the records of the table) the job stays in running condition but >> no document is processed : >> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7) >> >> >> >> There is no message in the logs (even with debug mode activated). >> The tests were done in a Docker cluster on dedicated servers (4 cores, 32 >> GB RAM each). >> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx and >> xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper based >> synchronization. >> We have a dump of the database here : >> https://www.datafari.com/files/mcftest.tar.gz >> >> In other cases, we also encounter this error : >> Error: Unexpected jobqueue status - record id 1555190422690, expecting >> active status, saw 0 >> >> Any ideas of what the problem could be ? or at least why there is nothing >> in the debug log you think ? >> >> Thank you, >> Best regards, >> >> Olivier >> > >
Re: issue for indexing large blobs into databases
It sounds like there might be an out-of-memory situation, although if that is the case I would expect that the MCF agents process would just shut itself down. If the process is hung, it should be easy to get a thread dump. That would be the first step. Some thoughts as to what might be happening: there might be an out-of-memory condition that is being silently eaten somewhere. MCF relies on its connectors to use streams rather than load documents into memory, BUT we're at the mercy of the JDBC drivers and Tika. So another experiment would be to try to crawl your documents without the tika extractor (just send them to the null output connector) and see if that succeeds. If it does, then maybe try the external Tika Service connector instead and see what happens then. Karl On Mon, May 6, 2019 at 5:56 AM Olivier Tavard wrote: > Hi MCF community, > > We have some issues with the JDBC connector to index database with large > LONGBLOB into it : I mean files more than 100 MB. > To reproduce the issue, I created a simple database on MySQL 5.7 with only > one table into it with 2 columns : id (int) and data (longblob). > There are 8 records in the table : two records with very small files : > less than 5 KB and 6 records with a large file of 192 MB. > > If I include in my crawl only one big BLOB, the crawl is successful. > seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6 > > But if I include more than 2 of these (with a WHERE IN condition or if > select all the records of the table) the job stays in running condition but > no document is processed : > seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7) > > > > There is no message in the logs (even with debug mode activated). > The tests were done in a Docker cluster on dedicated servers (4 cores, 32 > GB RAM each). > Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx and > xmx) the internal database is PostgreSQL 10.1 and we use Zookeeper based > synchronization. > We have a dump of the database here : > https://www.datafari.com/files/mcftest.tar.gz > > In other cases, we also encounter this error : > Error: Unexpected jobqueue status - record id 1555190422690, expecting > active status, saw 0 > > Any ideas of what the problem could be ? or at least why there is nothing > in the debug log you think ? > > Thank you, > Best regards, > > Olivier >