Hi Karl, Thanks for your answer and the quick fix.
Best regards, Olivier > Le 13 mai 2019 à 19:46, Karl Wright <daddy...@gmail.com> a écrit : > > I committed a fix for this hang to trunk. > Karl > > > On Mon, May 13, 2019 at 11:40 AM Karl Wright <daddy...@gmail.com> wrote: > >> Thanks -- this shows a classic deadlock occurring, where the System.exit() >> call effectively blocks the thread that does it from exiting, and thus >> blocks the system from shutting down. This is a change of behavior in the >> JVM; not sure when it happened, but this didn't happen before. But it >> easily addressed. >> >> Karl >> >> >> On Mon, May 13, 2019 at 10:58 AM Olivier Tavard < >> olivier.tav...@francelabs.com> wrote: >> >>> Hi Karl, >>> >>> Thank you for your mail. >>> I installed an ootb MCF 2.13 to have a clean installation. I launched the >>> File-based multiprocess example and I did the previous tests I mentioned. >>> I managed to see the error that you told me about : >>> >>> agents process ran out of memory - shutting down >>> java.lang.OutOfMemoryError: Java heap space >>> at com.mysql.jdbc.Buffer.ensureCapacity(Buffer.java:155) >>> at com.mysql.jdbc.Buffer.writeBytesNoNull(Buffer.java:513) >>> at >>> com.mysql.jdbc.MysqlIO.readRemainingMultiPackets(MysqlIO.java:3116) >>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2971) >>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2871) >>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3414) >>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910) >>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1405) >>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) >>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) >>> at >>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) >>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) >>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) >>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) >>> at >>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) >>> at >>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) >>> at >>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) >>> >>> But the mcf-agent process is still active, it was not shut down. >>> I did a new thread dump, I obtained this : >>> Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode): >>> >>> "Shutdown thread" #8 prio=5 os_prio=0 tid=0x00007f8ccc014000 nid=0x436a >>> in Object.wait() [0x00007f8c2a9e8000] >>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248) >>> - locked <0x00000000e0930708> (a java.lang.Integer) >>> at >>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617) >>> at >>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249) >>> at >>> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168) >>> - locked <0x00000000eab32808> (a java.util.HashMap) >>> at >>> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1540) >>> - locked <0x00000000eab24af8> (a java.util.ArrayList) >>> - locked <0x00000000eab24ca0> (a java.lang.Integer) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1718) >>> >>> "MySQL Statement Cancellation Timer" #2567 daemon prio=5 os_prio=0 >>> tid=0x00007f8c84061800 nid=0x4090 in Object.wait() [0x00007f8c2aeed000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at java.lang.Object.wait(Object.java:502) >>> at java.util.TimerThread.mainLoop(Timer.java:526) >>> - locked <0x00000000ebe79d20> (a java.util.TaskQueue) >>> at java.util.TimerThread.run(Timer.java:505) >>> >>> "Connection pool reaper" #2565 daemon prio=5 os_prio=0 >>> tid=0x00007f8cf809d000 nid=0x408e waiting on condition [0x00007f8c2aceb000] >>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> at java.lang.Thread.sleep(Native Method) >>> at >>> org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152) >>> >>> "Worker thread '0'" #66 daemon prio=5 os_prio=0 tid=0x00007f8cfc209800 >>> nid=0x35d8 in Object.wait() [0x00007f8d0c6f4000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at java.lang.Thread.join(Thread.java:1252) >>> - locked <0x00000000eab24b28> (a >>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread) >>> at java.lang.Thread.join(Thread.java:1326) >>> at >>> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) >>> at >>> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) >>> at java.lang.Shutdown.runHooks(Shutdown.java:123) >>> at java.lang.Shutdown.sequence(Shutdown.java:167) >>> at java.lang.Shutdown.exit(Shutdown.java:212) >>> - locked <0x00000000eaae4150> (a java.lang.Class for java.lang.Shutdown) >>> at java.lang.Runtime.exit(Runtime.java:109) >>> at java.lang.System.exit(System.java:971) >>> at >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:851) >>> >>> "Connection pool reaper" #13 daemon prio=5 os_prio=0 >>> tid=0x00007f8cfc04b000 nid=0x35a0 waiting on condition [0x00007f8d2c47b000] >>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> at java.lang.Thread.sleep(Native Method) >>> at >>> org.apache.manifoldcf.core.jdbcpool.ConnectionPoolManager$ConnectionCloserThread.run(ConnectionPoolManager.java:152) >>> >>> "Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00007f8d3c0b6000 >>> nid=0x359c runnable [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f8d3c0b3000 >>> nid=0x359b waiting on condition [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f8d3c0b0800 >>> nid=0x359a waiting on condition [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f8d3c0af000 >>> nid=0x3599 waiting on condition [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f8d3c07b800 nid=0x3598 >>> in Object.wait() [0x00007f8d410d3000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) >>> - locked <0x00000000eaaf2038> (a java.lang.ref.ReferenceQueue$Lock) >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) >>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216) >>> >>> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f8d3c079000 >>> nid=0x3597 in Object.wait() [0x00007f8d411d4000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at java.lang.Object.wait(Object.java:502) >>> at java.lang.ref.Reference.tryHandlePending(Reference.java:191) >>> - locked <0x00000000eaaf21f0> (a java.lang.ref.Reference$Lock) >>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) >>> >>> "main" #1 prio=5 os_prio=0 tid=0x00007f8d3c00b800 nid=0x3595 in >>> Object.wait() [0x00007f8d45902000] >>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> at >>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1248) >>> - locked <0x00000000e26ec910> (a java.lang.Integer) >>> at >>> org.apache.manifoldcf.agents.system.AgentsDaemon.runAgents(AgentsDaemon.java:115) >>> at org.apache.manifoldcf.agents.AgentRun.doExecute(AgentRun.java:64) >>> at >>> org.apache.manifoldcf.agents.BaseAgentsInitializationCommand.execute(BaseAgentsInitializationCommand.java:37) >>> at org.apache.manifoldcf.agents.AgentRun.main(AgentRun.java:93) >>> >>> "VM Thread" os_prio=0 tid=0x00007f8d3c06f800 nid=0x3596 runnable >>> >>> "VM Periodic Task Thread" os_prio=0 tid=0x00007f8d3c0b9000 nid=0x359d >>> waiting on condition >>> >>> JNI global references: 13 >>> >>> Heap >>> def new generation total 157248K, used 39902K [0x00000000e0000000, >>> 0x00000000eaaa0000, 0x00000000eaaa0000) >>> eden space 139776K, 28% used [0x00000000e0000000, 0x00000000e26f7a80, >>> 0x00000000e8880000) >>> from space 17472K, 0% used [0x00000000e9990000, 0x00000000e9990000, >>> 0x00000000eaaa0000) >>> to space 17472K, 0% used [0x00000000e8880000, 0x00000000e8880000, >>> 0x00000000e9990000) >>> tenured generation total 349568K, used 173322K [0x00000000eaaa0000, >>> 0x0000000100000000, 0x0000000100000000) >>> the space 349568K, 49% used [0x00000000eaaa0000, 0x00000000f53e2ac8, >>> 0x00000000f53e2c00, 0x0000000100000000) >>> Metaspace used 16548K, capacity 16750K, committed 17024K, reserved >>> 1064960K >>> class space used 1806K, capacity 1891K, committed 1920K, reserved >>> 1048576K >>> >>> Best regards, >>> >>> Olivier >>> >>>> Le 7 mai 2019 à 13:02, Karl Wright <daddy...@gmail.com> a écrit : >>>> >>>> The thread dump is not helpful. I see no traces for any threads that >>>> identify code from MCF at all. That's new and has nothing to do with >>> MCF; >>>> it's probably a new Java "feature". >>>> >>>> The OutOfMemoryException should, I've confirmed, be rethrown. The >>> Worker >>>> Thread code does in fact catch the exception and terminate the process: >>>> >>>>>>>>>> >>>> catch (OutOfMemoryError e) >>>> { >>>> System.err.println("agents process ran out of memory - shutting >>>> down"); >>>> e.printStackTrace(System.err); >>>> System.exit(-200); >>>> } >>>> <<<<<< >>>> >>>> So that is what you should see happening. >>>> >>>> Are you providing the complete stack trace from your out-of-memory >>>> exception? That would help confirm the picture. It had to have been >>>> printed from somewhere; it goes to standard error, whatever that is. >>>> >>>> Karl >>>> >>>> >>>> On Tue, May 7, 2019 at 6:42 AM Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> Hi Olivier, >>>>> >>>>> Any out-of-memory exception that makes it to the top level should cause >>>>> the agents process to shut itself down. >>>>> >>>>> The reason for this is simple: in a multithread environment, with lots >>> of >>>>> third-party jars, an out-of-memory in one thread is usually indicative >>> of >>>>> an out-of-memory in other threads as well, and we cannot count on that >>>>> third-party software being rigorous about letting an >>> OutOfMemoryException >>>>> be thrown to the top level. So we tend to get corrupted results that >>> are >>>>> hard to recover from. >>>>> >>>>> In this case, the exception is being thrown within the JDBCConnector, >>> so >>>>> it ought to rethrow the OutOfMemoryException. Let me confirm that is >>> what >>>>> the code indicates. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, May 7, 2019 at 6:14 AM Olivier Tavard < >>>>> olivier.tav...@francelabs.com> wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> Thank you for your help. >>>>>> Indeed I found a OoM error in the logs of the MCF-agent process : >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>> at com.mysql.jdbc.Buffer.<init>(Buffer.java:58) >>>>>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1441) >>>>>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2816) >>>>>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:467) >>>>>> at >>>>>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2510) >>>>>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1746) >>>>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2135) >>>>>> at >>> com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2542) >>>>>> at >>>>>> >>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1734) >>>>>> at >>>>>> >>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1885) >>>>>> at >>>>>> >>> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1392) >>>>>> >>>>>> >>>>>> But the MCF process agent was not killed, it is still active. Why is >>> the >>>>>> process still active in this case ? Since we know that we can get an >>> OOM, >>>>>> couldn't we find an elegant way to notify users of this problem from >>> the >>>>>> MCF admin UI, rather than just having the process endlessly active ? >>>>>> >>>>>> As for the OOM, the work around we decided to use is to increase the >>>>>> values of xms and xmx java memory in options.env.unix. >>>>>> I also attached the thread dump to this mail. >>>>>> >>>>>> Thanks, >>>>>> Best regards, >>>>>> >>>>>> Olivier >>>>>> >>>>>> Le 6 mai 2019 à 12:55, Karl Wright <daddy...@gmail.com> a écrit : >>>>>> >>>>>> It sounds like there might be an out-of-memory situation, although if >>>>>> that is the case I would expect that the MCF agents process would >>> just shut >>>>>> itself down. >>>>>> >>>>>> If the process is hung, it should be easy to get a thread dump. That >>>>>> would be the first step. >>>>>> >>>>>> Some thoughts as to what might be happening: there might be an >>>>>> out-of-memory condition that is being silently eaten somewhere. MCF >>> relies >>>>>> on its connectors to use streams rather than load documents into >>> memory, >>>>>> BUT we're at the mercy of the JDBC drivers and Tika. So another >>> experiment >>>>>> would be to try to crawl your documents without the tika extractor >>> (just >>>>>> send them to the null output connector) and see if that succeeds. If >>> it >>>>>> does, then maybe try the external Tika Service connector instead and >>> see >>>>>> what happens then. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Mon, May 6, 2019 at 5:56 AM Olivier Tavard < >>>>>> olivier.tav...@francelabs.com> wrote: >>>>>> >>>>>>> Hi MCF community, >>>>>>> >>>>>>> We have some issues with the JDBC connector to index database with >>> large >>>>>>> LONGBLOB into it : I mean files more than 100 MB. >>>>>>> To reproduce the issue, I created a simple database on MySQL 5.7 with >>>>>>> only one table into it with 2 columns : id (int) and data >>> (longblob). >>>>>>> There are 8 records in the table : two records with very small files >>> : >>>>>>> less than 5 KB and 6 records with a large file of 192 MB. >>>>>>> >>>>>>> If I include in my crawl only one big BLOB, the crawl is successful. >>>>>>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id =6 >>>>>>> >>>>>>> But if I include more than 2 of these (with a WHERE IN condition or >>> if >>>>>>> select all the records of the table) the job stays in running >>> condition but >>>>>>> no document is processed : >>>>>>> seeding query : SELECT id AS $(IDCOLUMN) FROM data WHERE id IN (6,7) >>>>>>> >>>>>>> >>>>>>> >>>>>>> There is no message in the logs (even with debug mode activated). >>>>>>> The tests were done in a Docker cluster on dedicated servers (4 >>> cores, >>>>>>> 32 GB RAM each). >>>>>>> Regarding MCF, the version is 2.12. The MCF agent has 1GB of RAM (xmx >>>>>>> and xmx) the internal database is PostgreSQL 10.1 and we use >>> Zookeeper >>>>>>> based synchronization. >>>>>>> We have a dump of the database here : >>>>>>> https://www.datafari.com/files/mcftest.tar.gz >>>>>>> >>>>>>> In other cases, we also encounter this error : >>>>>>> Error: Unexpected jobqueue status - record id 1555190422690, >>> expecting >>>>>>> active status, saw 0 >>>>>>> >>>>>>> Any ideas of what the problem could be ? or at least why there is >>>>>>> nothing in the debug log you think ? >>>>>>> >>>>>>> Thank you, >>>>>>> Best regards, >>>>>>> >>>>>>> Olivier >>>>>>> >>>>>> >>>>>> >>> >>>