Although connection count may not be the root cause, please read http://zhihongyu.blogspot.com/2011/04/managing-connections-in-hbase-090-and.htmlif you have time. 0.92.0 would do a much better job of managing connections.
On Mon, Jul 4, 2011 at 10:14 AM, Lior Schachter <li...@infolinks.com> wrote: > I will increase the number of connections to 1000. > > Thanks ! > > Lior > > > > > On Mon, Jul 4, 2011 at 8:12 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > The reason I asked about HBaseURLsDaysAggregator.java was that I see no > > HBase (client) code in call stack. > > I have little clue for the problem you experienced. > > > > There may be more than one connection to zookeeper from one map task. > > So it doesn't hurt if you increase > hbase.zookeeper.property.maxClientCnxns > > > > Cheers > > > > On Mon, Jul 4, 2011 at 9:47 AM, Lior Schachter <li...@infolinks.com> > > wrote: > > > > > 1. HBaseURLsDaysAggregator.java:124, HBaseURLsDaysAggregator.java:131 : > > are > > > not important since even when I removed all my map code the tasks got > > stuck > > > (but the thread dumps were generated after I revived the code). If you > > > think > > > its important I'll remove the map code again and re-generate the thread > > > dumps... > > > > > > 2. 82 maps were launched but only 36 ran simultaneously. > > > > > > 3. hbase.zookeeper.property.maxClientCnxns = 300. Should I increase it > ? > > > > > > Thanks, > > > Lior > > > > > > > > > On Mon, Jul 4, 2011 at 7:33 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > In the future, provide full dump using pastebin.com > > > > Write snippet of log in email. > > > > > > > > Can you tell us what the following lines are about ? > > > > HBaseURLsDaysAggregator.java:124 > > > > HBaseURLsDaysAggregator.java:131 > > > > > > > > How many mappers were launched ? > > > > > > > > What value is used for hbase.zookeeper.property.maxClientCnxns ? > > > > You may need to increase the value for above setting. > > > > > > > > On Mon, Jul 4, 2011 at 9:26 AM, Lior Schachter <li...@infolinks.com> > > > > wrote: > > > > > > > > > I used kill -3, following the thread dump: > > > > > > > > > > ... > > > > > > > > > > > > > > > On Mon, Jul 4, 2011 at 6:22 PM, Ted Yu <yuzhih...@gmail.com> > wrote: > > > > > > > > > > > I wasn't clear in my previous email. > > > > > > It was not answer to why map tasks got stuck. > > > > > > TableInputFormatBase.getSplits() is being called already. > > > > > > > > > > > > Can you try getting jstack of one of the map tasks before task > > > tracker > > > > > > kills > > > > > > it ? > > > > > > > > > > > > Thanks > > > > > > > > > > > > On Mon, Jul 4, 2011 at 8:15 AM, Lior Schachter < > > li...@infolinks.com> > > > > > > wrote: > > > > > > > > > > > > > 1. Currently every map gets one region. So I don't understand > > what > > > > > > > difference will it make using the splits. > > > > > > > 2. How should I use the TableInputFormatBase.getSplits() ? > Could > > > not > > > > > find > > > > > > > examples for that. > > > > > > > > > > > > > > Thanks, > > > > > > > Lior > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 4, 2011 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > For #2, see TableInputFormatBase.getSplits(): > > > > > > > > * Calculates the splits that will serve as input for the > map > > > > tasks. > > > > > > The > > > > > > > > * number of splits matches the number of regions in a > table. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 4, 2011 at 7:37 AM, Lior Schachter < > > > > li...@infolinks.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > 1. yes - I configure my job using this line: > > > > > > > > > > > > > TableMapReduceUtil.initTableMapperJob(HBaseConsts.URLS_TABLE_NAME, > > > > > > > scan, > > > > > > > > > ScanMapper.class, Text.class, MapWritable.class, job) > > > > > > > > > > > > > > > > > > which internally uses TableInputFormat.class > > > > > > > > > > > > > > > > > > 2. One split per region ? What do you mean ? How do I do > that > > ? > > > > > > > > > > > > > > > > > > 3. hbase version 0.90.2 > > > > > > > > > > > > > > > > > > 4. no exceptions. the logs are very clean. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 4, 2011 at 5:22 PM, Ted Yu < > yuzhih...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > Do you use TableInputFormat ? > > > > > > > > > > To scan large number of rows, it would be better to > produce > > > one > > > > > > Split > > > > > > > > per > > > > > > > > > > region. > > > > > > > > > > > > > > > > > > > > What HBase version do you use ? > > > > > > > > > > Do you find any exception in master / region server logs > > > around > > > > > the > > > > > > > > > moment > > > > > > > > > > of timeout ? > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > On Mon, Jul 4, 2011 at 4:48 AM, Lior Schachter < > > > > > > li...@infolinks.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > I'm running a scan using the M/R framework. > > > > > > > > > > > My table contains hundreds of millions of rows and I'm > > > > scanning > > > > > > > using > > > > > > > > > > > start/stop key about 50 million rows. > > > > > > > > > > > > > > > > > > > > > > The problem is that some map tasks get stuck and the > task > > > > > manager > > > > > > > > kills > > > > > > > > > > > these maps after 600 seconds. When retrying the task > > > > everything > > > > > > > works > > > > > > > > > > fine > > > > > > > > > > > (sometimes). > > > > > > > > > > > > > > > > > > > > > > To verify that the problem is in hbase (and not in the > > map > > > > > code) > > > > > > I > > > > > > > > > > removed > > > > > > > > > > > all the code from my map function, so it looks like > this: > > > > > > > > > > > public void map(ImmutableBytesWritable key, Result > value, > > > > > Context > > > > > > > > > > context) > > > > > > > > > > > throws IOException, InterruptedException { > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > Also, when the map got stuck on a region, I tried to > scan > > > > this > > > > > > > region > > > > > > > > > > > (using > > > > > > > > > > > simple scan from a Java main) and it worked fine. > > > > > > > > > > > > > > > > > > > > > > Any ideas ? > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Lior > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >