I used kill -3, following the thread dump:

Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode):

"IPC Client (47) connection to /127.0.0.1:59759 from hadoop" daemon
prio=10 tid=0x00002aaab05ca800 nid=0x4eaf in Object.wait()
[0x00000000403c1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f9dba860> (a 
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
        - locked <0x00000000f9dba860> (a 
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)

"SpillThread" daemon prio=10 tid=0x00002aaab0585000 nid=0x4c99 waiting
on condition [0x00000000404c2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f9af0c38> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169)

"main-EventThread" daemon prio=10 tid=0x00002aaab035d000 nid=0x4c95
waiting on condition [0x0000000041207000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f9af5f58> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

"main-SendThread(hadoop09.infolinks.local:2181)" daemon prio=10
tid=0x00002aaab035c000 nid=0x4c94 runnable [0x0000000040815000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x00000000f9af61a8> (a sun.nio.ch.Util$2)
        - locked <0x00000000f9af61b8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000f9af6160> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1107)

"communication thread" daemon prio=10 tid=0x000000004d020000
nid=0x4c93 waiting on condition [0x0000000042497000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Hashtable.put(Hashtable.java:420)
        - locked <0x00000000f9dbaa58> (a java.util.Hashtable)
        at org.apache.hadoop.ipc.Client$Connection.addCall(Client.java:225)
        - locked <0x00000000f9dba860> (a 
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.access$1600(Client.java:176)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:854)
        at org.apache.hadoop.ipc.Client.call(Client.java:720)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:548)
        at java.lang.Thread.run(Thread.java:662)

"Thread for syncLogs" daemon prio=10 tid=0x00002aaab02e9800 nid=0x4c90
runnable [0x0000000040714000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuilder.append(StringBuilder.java:119)
        at java.io.UnixFileSystem.resolve(UnixFileSystem.java:93)
        at java.io.File.<init>(File.java:312)
        at org.apache.hadoop.mapred.TaskLog.getTaskLogFile(TaskLog.java:72)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:180)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:230)
        - locked <0x00000000eea92fc0> (a java.lang.Class for
org.apache.hadoop.mapred.TaskLog)
        at org.apache.hadoop.mapred.Child$2.run(Child.java:89)

"Low Memory Detector" daemon prio=10 tid=0x00002aaab0001800 nid=0x4c86
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x000000004cb4e800 nid=0x4c85
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x000000004cb4b000 nid=0x4c84
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x000000004cb49000 nid=0x4c83
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x000000004cb2c800 nid=0x4c82 in
Object.wait() [0x0000000041d7a000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        - locked <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x000000004cb25000 nid=0x4c81
in Object.wait() [0x0000000041005000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x000000004cab9000 nid=0x4c77 runnable [0x0000000040f04000]
   java.lang.Thread.State: RUNNABLE
        at 
com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:124)
        at 
com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:108)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

"VM Thread" prio=10 tid=0x000000004cb1e800 nid=0x4c80 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004cacc000
nid=0x4c78 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004cace000
nid=0x4c79 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x000000004cad0000
nid=0x4c7a runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x000000004cad1800
nid=0x4c7b runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x000000004cad3800
nid=0x4c7c runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x000000004cad5800
nid=0x4c7d runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x000000004cad7000
nid=0x4c7e runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004cad9000
nid=0x4c7f runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aaab000c000 nid=0x4c87
waiting on condition

JNI global references: 1099

Heap
 PSYoungGen      total 59712K, used 51200K [0x00000000fbd60000,
0x0000000100000000, 0x0000000100000000)
  eden space 51200K, 100% used
[0x00000000fbd60000,0x00000000fef60000,0x00000000fef60000)
  from space 8512K, 0% used
[0x00000000ff7b0000,0x00000000ff7b0000,0x0000000100000000)
  to   space 8512K, 0% used
[0x00000000fef60000,0x00000000fef60000,0x00000000ff7b0000)
 PSOldGen        total 136576K, used 104203K [0x00000000f3800000,
0x00000000fbd60000, 0x00000000fbd60000)
  object space 136576K, 76% used
[0x00000000f3800000,0x00000000f9dc2d68,0x00000000fbd60000)
 PSPermGen       total 21504K, used 11998K [0x00000000ee600000,
0x00000000efb00000, 0x00000000f3800000)
  object space 21504K, 55% used
[0x00000000ee600000,0x00000000ef1b7aa0,0x00000000efb00000)

2011-07-04 15:35:02

Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode):

"IPC Client (47) connection to /127.0.0.1:59759 from hadoop" daemon
prio=10 tid=0x00002aaab05ca800 nid=0x4eaf in Object.wait()
[0x00000000403c1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f9dba860> (a 
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
        - locked <0x00000000f9dba860> (a 
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)

"SpillThread" daemon prio=10 tid=0x00002aaab0585000 nid=0x4c99 waiting
on condition [0x00000000404c2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f9af0c38> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1169)

"main-EventThread" daemon prio=10 tid=0x00002aaab035d000 nid=0x4c95
waiting on condition [0x0000000041207000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f9af5f58> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

"main-SendThread(hadoop09.infolinks.local:2181)" daemon prio=10
tid=0x00002aaab035c000 nid=0x4c94 runnable [0x0000000040815000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x00000000f9af61a8> (a sun.nio.ch.Util$2)
        - locked <0x00000000f9af61b8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000f9af6160> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1107)

"communication thread" daemon prio=10 tid=0x000000004d020000
nid=0x4c93 waiting on condition [0x0000000042497000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:529)
        at java.lang.Thread.run(Thread.java:662)

"Thread for syncLogs" daemon prio=10 tid=0x00002aaab02e9800 nid=0x4c90
waiting on condition [0x0000000040714000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:215)
        at java.lang.StringBuilder.toString(StringBuilder.java:430)
        at org.apache.hadoop.fs.Path.<init>(Path.java:65)
        at org.apache.hadoop.fs.Path.<init>(Path.java:50)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:73)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:410)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:193)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:230)
        - locked <0x00000000eea92fc0> (a java.lang.Class for
org.apache.hadoop.mapred.TaskLog)
        at org.apache.hadoop.mapred.Child$2.run(Child.java:89)

"Low Memory Detector" daemon prio=10 tid=0x00002aaab0001800 nid=0x4c86
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x000000004cb4e800 nid=0x4c85
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x000000004cb4b000 nid=0x4c84
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x000000004cb49000 nid=0x4c83
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x000000004cb2c800 nid=0x4c82 in
Object.wait() [0x0000000041d7a000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
        - locked <0x00000000f9c52630> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x000000004cb25000 nid=0x4c81
in Object.wait() [0x0000000041005000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:485)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x00000000f9af5ea0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x000000004cab9000 nid=0x4c77 runnable [0x0000000040f04000]
   java.lang.Thread.State: RUNNABLE
        at 
com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:131)
        at 
com.infolinks.hadoop.hbase.HBaseURLsDaysAggregator$ScanMapper.map(HBaseURLsDaysAggregator.java:108)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

"VM Thread" prio=10 tid=0x000000004cb1e800 nid=0x4c80 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004cacc000
nid=0x4c78 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004cace000
nid=0x4c79 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x000000004cad0000
nid=0x4c7a runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x000000004cad1800
nid=0x4c7b runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x000000004cad3800
nid=0x4c7c runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x000000004cad5800
nid=0x4c7d runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x000000004cad7000
nid=0x4c7e runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004cad9000
nid=0x4c7f runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aaab000c000 nid=0x4c87
waiting on condition

JNI global references: 1099

Heap
 PSYoungGen      total 59712K, used 51200K [0x00000000fbd60000,
0x0000000100000000, 0x0000000100000000)
  eden space 51200K, 100% used
[0x00000000fbd60000,0x00000000fef60000,0x00000000fef60000)
  from space 8512K, 0% used
[0x00000000ff7b0000,0x00000000ff7b0000,0x0000000100000000)
  to   space 8512K, 0% used
[0x00000000fef60000,0x00000000fef60000,0x00000000ff7b0000)
 PSOldGen        total 136576K, used 104212K [0x00000000f3800000,
0x00000000fbd60000, 0x00000000fbd60000)
  object space 136576K, 76% used
[0x00000000f3800000,0x00000000f9dc50b8,0x00000000fbd60000)
 PSPermGen       total 21504K, used 11998K [0x00000000ee600000,
0x00000000efb00000, 0x00000000f3800000)
  object space 21504K, 55% used
[0x00000000ee600000,0x00000000ef1b7aa0,0x00000000efb00000)



On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I wasn't clear in my previous email.
> It was not answer to why map tasks got stuck.
> TableInputFormatBase.getSplits() is being called already.
>
> Can you try getting jstack of one of the map tasks before task tracker
> kills
> it ?
>
> Thanks
>
> On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter <li...@infolinks.com>
> wrote:
>
> > 1. Currently every map gets one region. So I don't understand what
> > difference will it make using the splits.
> > 2. How should I use the TableInputFormatBase.getSplits() ? Could not find
> > examples for that.
> >
> > Thanks,
> > Lior
> >
> >
> > On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > For #2, see TableInputFormatBase.getSplits():
> > >   * Calculates the splits that will serve as input for the map tasks.
> The
> > >   * number of splits matches the number of regions in a table.
> > >
> > >
> > > On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter <li...@infolinks.com>
> > > wrote:
> > >
> > > > 1. yes - I configure my job using this line:
> > > > TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME,
> > scan,
> > > > ScanMapper.class, Text.class, MapWritable.class, job)
> > > >
> > > > which internally uses TableInputFormat.class
> > > >
> > > > 2. One split per region ? What do you mean ? How do I do that ?
> > > >
> > > > 3. hbase version 0.90.2
> > > >
> > > > 4. no exceptions. the logs are very clean.
> > > >
> > > >
> > > >
> > > > On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> > > >
> > > > > Do you use TableInputFormat ?
> > > > > To scan large number of rows, it would be better to produce one
> Split
> > > per
> > > > > region.
> > > > >
> > > > > What HBase version do you use ?
> > > > > Do you find any exception in master / region server logs around the
> > > > moment
> > > > > of timeout ?
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter <
> li...@infolinks.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > > I'm running a scan using the M/R framework.
> > > > > > My table contains hundreds of millions of rows and I'm scanning
> > using
> > > > > > start/stop key about 50 million rows.
> > > > > >
> > > > > > The problem is that some map tasks get stuck and the task manager
> > > kills
> > > > > > these maps after 600 seconds. When retrying the task everything
> > works
> > > > > fine
> > > > > > (sometimes).
> > > > > >
> > > > > > To verify that the problem is in hbase (and not in the map code)
> I
> > > > > removed
> > > > > > all the code from my map function, so it looks like this:
> > > > > > public void map(ImmutableBytesWritable key, Result value, Context
> > > > > context)
> > > > > > throws IOException, InterruptedException {
> > > > > > }
> > > > > >
> > > > > > Also, when the map got stuck on a region, I tried to scan this
> > region
> > > > > > (using
> > > > > > simple scan from a Java main) and it worked fine.
> > > > > >
> > > > > > Any ideas ?
> > > > > >
> > > > > > Thanks,
> > > > > > Lior
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to