[ https://issues.apache.org/jira/browse/DRILL-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063080#comment-16063080 ]
Roman commented on DRILL-4595: ------------------------------ I tried to catch some issues with CTAS queries and got something. Steps: 1) set small DRILL_HEAP and DRILL_MAX_DIRECT_MEMORY to drill-env.sh {code:sql} Example: export DRILL_HEAP=${DRILL_HEAP:-"1G"} export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"1G"} {code} 2) run long CTAS query {code:sql} Example: CREATE TABLE dfs.tmp.table3 AS SELECT * FROM dfs.tpcds_sf1_parquet_views.web_sales; {code} After that drillbit fails (process was killed) with error: {code:sql} Error: CONNECTION ERROR: Connection /192.168.121.7:47697 <--> node1/192.168.121.7:31010 (user client) closed unexpectedly. Drillbit down? [Error Id: 3de27393-8f21-4869-acd3-c4a14d01ed44 ] (state=,code=0) {code} Information from drillbit.log: {code:sql} 2017-06-26 13:02:53,062 [26aefa29-490b-e807-d093-548607458d28:frag:1:0] ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. Information message: Unable to handle out of memory condition in FragmentExecutor. java.lang.OutOfMemoryError: Java heap space at java.util.AbstractList.iterator(AbstractList.java:288) ~[na:1.8.0_131] at org.apache.parquet.bytes.BytesInput$SequenceBytesIn.writeAllTo(BytesInput.java:263) ~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:174) ~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33) ~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:118) ~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:154) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:115) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:187) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addDouble(MessageColumnIO.java:483) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.drill.exec.store.ParquetOutputRecordWriter$NullableFloat8ParquetConverter.writeField(ParquetOutputRecordWriter.java:970) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.EventBasedRecordWriter.write(EventBasedRecordWriter.java:65) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:106) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_131] at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_131] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) ~[hadoop-common-2.7.0-mapr-1607.jar:na] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] 2017-06-26 13:02:53,924 [26aefa29-490b-e807-d093-548607458d28:frag:1:1] ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. Information message: Unable to handle out of memory condition in FragmentExecutor. java.lang.OutOfMemoryError: Java heap space at java.util.AbstractList.iterator(AbstractList.java:288) ~[na:1.8.0_131] at org.apache.parquet.bytes.BytesInput$SequenceBytesIn.writeAllTo(BytesInput.java:263) ~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:174) ~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:185) ~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.hadoop.DirectCodecFactory$SnappyCompressor.compress(DirectCodecFactory.java:291) ~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:94) ~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:154) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:115) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:187) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addDouble(MessageColumnIO.java:483) ~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0] at org.apache.drill.exec.store.ParquetOutputRecordWriter$NullableFloat8ParquetConverter.writeField(ParquetOutputRecordWriter.java:970) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.store.EventBasedRecordWriter.write(EventBasedRecordWriter.java:65) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:106) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_131] at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_131] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) ~[hadoop-common-2.7.0-mapr-1607.jar:na] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227) ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] 2017-06-26 13:02:54,065 [Drillbit-ShutdownHook#0] INFO o.apache.drill.exec.server.Drillbit - Received shutdown request. 2017-06-26 13:03:01,593 [pool-7-thread-2] INFO o.a.drill.exec.rpc.data.DataServer - closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@4f4a0f30 in 1058 ms 2017-06-26 13:03:01,594 [pool-7-thread-2] INFO o.a.drill.exec.service.ServiceEngine - closed dataPool in 1058 ms 2017-06-26 13:03:03,657 [Drillbit-ShutdownHook#0] WARN o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 2 running fragments. 2017-06-26 13:03:03,657 [Drillbit-ShutdownHook#0] INFO o.a.drill.exec.compile.CodeCompiler - Stats: code gen count: 4, cache miss count: 1, hit rate: 75% {code} Also I see that table file was not cleaned up. It seems we get data corruption: {code:sql} hadoop fs -ls /tmp/ Found 1 items drwxrwxr-x - mapr users 2 2017-06-26 13:02 /tmp/table3 {code} After drillbit starting there was not information about this query in UI. I used single node drillbit cluster and Drill from commit a7e298760f9c9efa. > FragmentExecutor.fail() should interrupt the fragment thread to avoid > possible query hangs > ------------------------------------------------------------------------------------------ > > Key: DRILL-4595 > URL: https://issues.apache.org/jira/browse/DRILL-4595 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.4.0 > Reporter: Deneche A. Hakim > Assignee: Deneche A. Hakim > Fix For: Future > > > When a fragment fails it's assumed it will be able to close itself and send > it's FAILED state to the foreman which will cancel any running fragments. > FragmentExecutor.cancel() will interrupt the thread making sure those > fragment don't stay blocked. > However, if a fragment is already blocked when it's fail method is called the > foreman may never be notified about this and the query will hang forever. One > such scenario is the following: > - generally it's a CTAS running on a large cluster (lot's of writers running > in parallel) > - logs show that the user channel was closed and UserServer caused the root > fragment to move to a FAILED state > - jstack shows that the root fragment is blocked in it's receiver waiting for > data > - jstack also shows that ALL other fragments are no longer running, and the > logs show that all of them succeeded > - the foreman waits *forever* for the root fragment to finish -- This message was sent by Atlassian JIRA (v6.4.14#64029)