I used kill -3, following the thread dump: Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode):
"IPC Client (47) connection to /127.0.0.1:59759 from hadoop" daemon prio=10 tid=0x00002aaab05ca800 nid=0x4eaf in Object.wait() [0x00000000403c1000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403) - locked <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445) "SpillThread" daemon prio=10 tid=0x00002aaab0585000 nid=0x4c99 waiting on condition [0x00000000404c2000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f9af0c38> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169) "main-EventThread" daemon prio=10 tid=0x00002aaab035d000 nid=0x4c95 waiting on condition [0x0000000041207000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f9af5f58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502) "main-SendThread(hadoop09.infolinks.local:2181)" daemon prio=10 tid=0x00002aaab035c000 nid=0x4c94 runnable [0x0000000040815000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x00000000f9af61a8> (a sun.nio.ch.Util$2) - locked <0x00000000f9af61b8> (a java.util.Collections$UnmodifiableSet) - locked <0x00000000f9af6160> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1107) "communication thread" daemon prio=10 tid=0x000000004d020000 nid=0x4c93 waiting on condition [0x0000000042497000] java.lang.Thread.State: RUNNABLE at java.util.Hashtable.put(Hashtable.java:420) - locked <0x00000000f9dbaa58> (a java.util.Hashtable) at org.apache.hadoop.ipc.Client$Connection.addCall(Client.java:225) - locked <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.access$1600(Client.java:176) at org.apache.hadoop.ipc.Client.getConnection(Client.java:854) at org.apache.hadoop.ipc.Client.call(Client.java:720) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:548) at java.lang.Thread.run(Thread.java:662) "Thread for syncLogs" daemon prio=10 tid=0x00002aaab02e9800 nid=0x4c90 runnable [0x0000000040714000] java.lang.Thread.State: RUNNABLE at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at java.io.UnixFileSystem.resolve(UnixFileSystem.java:93) at java.io.File.<init>(File.java:312) at org.apache.hadoop.mapred.TaskLog.getTaskLogFile(TaskLog.java:72) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:180) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:230) - locked <0x00000000eea92fc0> (a java.lang.Class for org.apache.hadoop.mapred.TaskLog) at org.apache.hadoop.mapred.Child$2.run(Child.java:89) "Low Memory Detector" daemon prio=10 tid=0x00002aaab0001800 nid=0x4c86 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x000000004cb4e800 nid=0x4c85 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x000000004cb4b000 nid=0x4c84 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x000000004cb49000 nid=0x4c83 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x000000004cb2c800 nid=0x4c82 in Object.wait() [0x0000000041d7a000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) - locked <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=10 tid=0x000000004cb25000 nid=0x4c81 in Object.wait() [0x0000000041005000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock) "main" prio=10 tid=0x000000004cab9000 nid=0x4c77 runnable [0x0000000040f04000] java.lang.Thread.State: RUNNABLE at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:124) at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:108) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) "VM Thread" prio=10 tid=0x000000004cb1e800 nid=0x4c80 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004cacc000 nid=0x4c78 runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004cace000 nid=0x4c79 runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000004cad0000 nid=0x4c7a runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000004cad1800 nid=0x4c7b runnable "GC task thread#4 (ParallelGC)" prio=10 tid=0x000000004cad3800 nid=0x4c7c runnable "GC task thread#5 (ParallelGC)" prio=10 tid=0x000000004cad5800 nid=0x4c7d runnable "GC task thread#6 (ParallelGC)" prio=10 tid=0x000000004cad7000 nid=0x4c7e runnable "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004cad9000 nid=0x4c7f runnable "VM Periodic Task Thread" prio=10 tid=0x00002aaab000c000 nid=0x4c87 waiting on condition JNI global references: 1099 Heap PSYoungGen total 59712K, used 51200K [0x00000000fbd60000, 0x0000000100000000, 0x0000000100000000) eden space 51200K, 100% used [0x00000000fbd60000,0x00000000fef60000,0x00000000fef60000) from space 8512K, 0% used [0x00000000ff7b0000,0x00000000ff7b0000,0x0000000100000000) to space 8512K, 0% used [0x00000000fef60000,0x00000000fef60000,0x00000000ff7b0000) PSOldGen total 136576K, used 104203K [0x00000000f3800000, 0x00000000fbd60000, 0x00000000fbd60000) object space 136576K, 76% used [0x00000000f3800000,0x00000000f9dc2d68,0x00000000fbd60000) PSPermGen total 21504K, used 11998K [0x00000000ee600000, 0x00000000efb00000, 0x00000000f3800000) object space 21504K, 55% used [0x00000000ee600000,0x00000000ef1b7aa0,0x00000000efb00000) 2011-07-04 15:35:02 Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode): "IPC Client (47) connection to /127.0.0.1:59759 from hadoop" daemon prio=10 tid=0x00002aaab05ca800 nid=0x4eaf in Object.wait() [0x00000000403c1000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403) - locked <0x00000000f9dba860> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445) "SpillThread" daemon prio=10 tid=0x00002aaab0585000 nid=0x4c99 waiting on condition [0x00000000404c2000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f9af0c38> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169) "main-EventThread" daemon prio=10 tid=0x00002aaab035d000 nid=0x4c95 waiting on condition [0x0000000041207000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f9af5f58> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502) "main-SendThread(hadoop09.infolinks.local:2181)" daemon prio=10 tid=0x00002aaab035c000 nid=0x4c94 runnable [0x0000000040815000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x00000000f9af61a8> (a sun.nio.ch.Util$2) - locked <0x00000000f9af61b8> (a java.util.Collections$UnmodifiableSet) - locked <0x00000000f9af6160> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1107) "communication thread" daemon prio=10 tid=0x000000004d020000 nid=0x4c93 waiting on condition [0x0000000042497000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:529) at java.lang.Thread.run(Thread.java:662) "Thread for syncLogs" daemon prio=10 tid=0x00002aaab02e9800 nid=0x4c90 waiting on condition [0x0000000040714000] java.lang.Thread.State: RUNNABLE at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.<init>(String.java:215) at java.lang.StringBuilder.toString(StringBuilder.java:430) at org.apache.hadoop.fs.Path.<init>(Path.java:65) at org.apache.hadoop.fs.Path.<init>(Path.java:50) at org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:73) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:410) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:193) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:230) - locked <0x00000000eea92fc0> (a java.lang.Class for org.apache.hadoop.mapred.TaskLog) at org.apache.hadoop.mapred.Child$2.run(Child.java:89) "Low Memory Detector" daemon prio=10 tid=0x00002aaab0001800 nid=0x4c86 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x000000004cb4e800 nid=0x4c85 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x000000004cb4b000 nid=0x4c84 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x000000004cb49000 nid=0x4c83 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x000000004cb2c800 nid=0x4c82 in Object.wait() [0x0000000041d7a000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) - locked <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=10 tid=0x000000004cb25000 nid=0x4c81 in Object.wait() [0x0000000041005000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock) "main" prio=10 tid=0x000000004cab9000 nid=0x4c77 runnable [0x0000000040f04000] java.lang.Thread.State: RUNNABLE at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:131) at com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:108) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) "VM Thread" prio=10 tid=0x000000004cb1e800 nid=0x4c80 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004cacc000 nid=0x4c78 runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004cace000 nid=0x4c79 runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x000000004cad0000 nid=0x4c7a runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x000000004cad1800 nid=0x4c7b runnable "GC task thread#4 (ParallelGC)" prio=10 tid=0x000000004cad3800 nid=0x4c7c runnable "GC task thread#5 (ParallelGC)" prio=10 tid=0x000000004cad5800 nid=0x4c7d runnable "GC task thread#6 (ParallelGC)" prio=10 tid=0x000000004cad7000 nid=0x4c7e runnable "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004cad9000 nid=0x4c7f runnable "VM Periodic Task Thread" prio=10 tid=0x00002aaab000c000 nid=0x4c87 waiting on condition JNI global references: 1099 Heap PSYoungGen total 59712K, used 51200K [0x00000000fbd60000, 0x0000000100000000, 0x0000000100000000) eden space 51200K, 100% used [0x00000000fbd60000,0x00000000fef60000,0x00000000fef60000) from space 8512K, 0% used [0x00000000ff7b0000,0x00000000ff7b0000,0x0000000100000000) to space 8512K, 0% used [0x00000000fef60000,0x00000000fef60000,0x00000000ff7b0000) PSOldGen total 136576K, used 104212K [0x00000000f3800000, 0x00000000fbd60000, 0x00000000fbd60000) object space 136576K, 76% used [0x00000000f3800000,0x00000000f9dc50b8,0x00000000fbd60000) PSPermGen total 21504K, used 11998K [0x00000000ee600000, 0x00000000efb00000, 0x00000000f3800000) object space 21504K, 55% used [0x00000000ee600000,0x00000000ef1b7aa0,0x00000000efb00000) On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu <yuzhih...@gmail.com> wrote: > I wasn't clear in my previous email. > It was not answer to why map tasks got stuck. > TableInputFormatBase.getSplits() is being called already. > > Can you try getting jstack of one of the map tasks before task tracker > kills > it ? > > Thanks > > On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <li...@infolinks.com> > wrote: > > > 1. Currently every map gets one region. So I don't understand what > > difference will it make using the splits. > > 2. How should I use the TableInputFormatBase.getSplits() ? Could not find > > examples for that. > > > > Thanks, > > Lior > > > > > > On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > For #2, see TableInputFormatBase.getSplits(): > > > * Calculates the splits that will serve as input for the map tasks. > The > > > * number of splits matches the number of regions in a table. > > > > > > > > > On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <li...@infolinks.com> > > > wrote: > > > > > > > 1. yes - I configure my job using this line: > > > > TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME, > > scan, > > > > ScanMapper.class, Text.class, MapWritable.class, job) > > > > > > > > which internally uses TableInputFormat.class > > > > > > > > 2. One split per region ? What do you mean ? How do I do that ? > > > > > > > > 3. hbase version 0.90.2 > > > > > > > > 4. no exceptions. the logs are very clean. > > > > > > > > > > > > > > > > On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > > > Do you use TableInputFormat ? > > > > > To scan large number of rows, it would be better to produce one > Split > > > per > > > > > region. > > > > > > > > > > What HBase version do you use ? > > > > > Do you find any exception in master / region server logs around the > > > > moment > > > > > of timeout ? > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter < > li...@infolinks.com> > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > I'm running a scan using the M/R framework. > > > > > > My table contains hundreds of millions of rows and I'm scanning > > using > > > > > > start/stop key about 50 million rows. > > > > > > > > > > > > The problem is that some map tasks get stuck and the task manager > > > kills > > > > > > these maps after 600 seconds. When retrying the task everything > > works > > > > > fine > > > > > > (sometimes). > > > > > > > > > > > > To verify that the problem is in hbase (and not in the map code) > I > > > > > removed > > > > > > all the code from my map function, so it looks like this: > > > > > > public void map(ImmutableBytesWritable key, Result value, Context > > > > > context) > > > > > > throws IOException, InterruptedException { > > > > > > } > > > > > > > > > > > > Also, when the map got stuck on a region, I tried to scan this > > region > > > > > > (using > > > > > > simple scan from a Java main) and it worked fine. > > > > > > > > > > > > Any ideas ? > > > > > > > > > > > > Thanks, > > > > > > Lior > > > > > > > > > > > > > > > > > > > > >