Recovering ranges from crashed RangeServers is one of the high  
priority items Doug is working on.

-Sanjit

On Jul 21, 2009, at 7:59 PM, kuer wrote:

>
> Hi, all,
>
> Another question,  as one of range-servers will coredump when
> replaying commit log, so I just stop rebooting it. But this time, the
> whole HT system seems stop working, too.
>
> Client program complain socket.timeout,
>
> hyperspace shell hangs :
> hypertable> show tables;
> METADATA
> kvcache
> storage_se
>
>  Elapsed time:  0.00 s
> hypertable> show create table storage_se;
> ^^^^^ waiting for .... ????
>
> Logging messages from Hypertable.Master :
>
> 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [ERROR]
> (AsyncComm/Comm.cc:212) No connection for 221.194.134.173:31060
> 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [WARN] (Lib/
> RangeServerClient.cc:312) Comm::send_request to 221.194.134.173:31060
> failed - COMM not connected
> 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [ERROR]
> find_range_and_start_scan (Lib/IntervalScanner.cc:408):
> Hypertable::Exception: Comm::send_request to 221.194.134.173:31060
> failed - COMM not connected
>       at void Hypertable::RangeServerClient::send_message(const
> sockaddr_in&, Hypertable::CommBufPtr&, Hypertable::DispatchHandler*)
> (Lib/RangeServerClient.cc:314)
> 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [ERROR] (Master/
> MasterGc.cc:239) Error: caught exception while gc'ing: Problem
> creating scanner on METADATA[..0:��]
>
> NOTES: 221.194.134.173 is IP of the box where RangeServer went wrong.
>
> My question is :
> since all information are shared by all rangeserver, why not
> hypertable.master reassign the ranges to other rangeserver when some
> of rangeservers go out of work ???
>
> thanks
>
>   -- kuer
>
>
>
>
> On 7月22日, 上午10时43分, kuer <[email protected]> wrote:
>> Hi, Sanjit,
>>
>> I just upload the second part of range.log  range.20090722.log. 
>> 2.gz。
>>
>> the first part of range.20090722.log.1.gz is about 18MB, it exceed  
>> the
>> limits of upload files.
>>
>> http://hypertable-dev.googlegroups.com/web/range.20090722.log.2.gz? 
>> gd...
>>
>> IF it is necessary, I will split the first log file and upload them.
>>
>> Thanks
>>
>>   -- kuer
>>
>> On 7月22日, 上午10时15分, Sanjit Jhala <[email protected]>  
>> wrote:
>>
>>> Hi Kuer,
>>
>>> You can gzip the RangeServer log and post them to the File Upload
>>> Page. Thanks for reporting this issue.
>>
>>> -Sanjit
>>
>>> On Jul 21, 2009, at 6:44 PM, kuer wrote:
>>
>>>> Hi, Sanjit,
>>
>>>> with --debug option, I get some logging message, but the file is  
>>>> big,
>>>> how to share it with you?
>>
>>>> gdb backtrace of core files
>>
>>>> (gdb) bt
>>>> #0  0x0000000000538272 in
>>>> Hypertable
>>>> ::BasicBloomFilter<Hypertable::MurmurHash2>::BasicBloomFilter
>>>> ()
>>>> #1  0x000000000053d3be in  
>>>> Hypertable::CellStoreV1::create_bloom_filter
>>>> ()
>>>> #2  0x000000000053e10e in Hypertable::CellStoreV1::finalize ()
>>>> #3  0x000000000051f112 in Hypertable::AccessGroup::run_compaction  
>>>> ()
>>>> #4  0x0000000000504e45 in  
>>>> Hypertable::Range::split_compact_and_shrink
>>>> ()
>>>> #5  0x0000000000509310 in Hypertable::Range::split ()
>>>> #6  0x00000000004ec693 in
>>>> Hypertable::MaintenanceQueue::Worker::operator() ()
>>>> #7  0x00000000006a5c40 in thread_proxy ()
>>>> #8  0x00000038ae406367 in start_thread () from /lib64/ 
>>>> libpthread.so.0
>>>> #9  0x00000038ad8d2f7d in clone () from /lib64/libc.so.6
>>
>>>> -- kuer
>>
>>>> On 7月22日, 上午9时07分, Sanjit Jhala <[email protected]>  
>>>> wrote:
>>>>> Hi Kuer,
>>
>>>>> This looks like a bug in the RangeServer code. The RangeServer is
>>>>> trying to create a CellStore file and while creating the  
>>>>> CellStore's
>>>>> BloomFilter its hitting an error condition.
>>
>>>>> Can you try a couple of things to help debug this issue?
>>
>>>>> Firstly turn on the RangeServer debug logging and report  
>>>>> RangeServer
>>>>> logs. You can do this by adding the global option --debug to your
>>>>> start-all-servers.sh command line. Example: <
>>>>> $HYPERTABLE_INSTALL_DIR>/
>>>>> bin/start-all-servers.sh kfs --debug
>>
>>>>> Secondly, if you could compile a debug build and send the stack  
>>>>> trace
>>>>> that would be helpful. To do this, from your hypertable build
>>>>> directory run
>>>>> ccmake <$HYPERTABLE_SRC_DIR> and make  sure CMAKE_BUILD_TYPE is  
>>>>> set
>>>>> to
>>>>> Debug and install the new build. After you try to bring up the
>>>>> RangeServer and it dumps core, you can load the core file in gdb  
>>>>> (Eg:
>>>>> gdb gdb <$HYPERTABLE_INSTALL_DIR>/bin/Hypertable.RangeServer <
>>>>> $CORE_FILE>). You can run bt (backtrace) in gdb to get the stack
>>>>> trace.
>>
>>>>> -Sanjit
>>
>>>>> On Jul 21, 2009, at 5:36 PM, kuer wrote:
>>
>>>>>> Hi, all,
>>
>>>>>> one of RangeServers hangs after coredump and restarting . here  
>>>>>> are
>>>>>> messages in rangeserver's log :
>>
>>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [WARN]
>>>>>> (Lib/
>>>>>> CommitLog.cc:250) clgc LOG FRAGMENT PURGE breaking because
>>>>>> 1246607682171649001 >= 1246607682128108001 (file='/hypertable/
>>>>>> servers/
>>>>>> 221.194.134.173_31060/log/root/0')
>>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [WARN]
>>>>>> (Lib/
>>>>>> CommitLog.cc:250) clgc LOG FRAGMENT PURGE breaking because
>>>>>> 1248187695757932563 >= 1247819802453791364 (file='/hypertable/
>>>>>> servers/
>>>>>> 221.194.134.173_31060/log/metadata/2')
>>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [WARN]
>>>>>> (Lib/
>>>>>> CommitLog.cc:250) clgc LOG FRAGMENT PURGE breaking because
>>>>>> 1248193806824860161 >= 1248189458336849002 (file='/hypertable/
>>>>>> servers/
>>>>>> 221.194.134.173_31060/log/user/401')
>>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/MaintenancePrioritizerLogCleanup.cc:103) Adding
>>>>>> maintenance for range METADATA[0: .. ] because mid-split(1)
>>>>>> 2009-07-22 08:23:41,449 1295067456 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/RangeServer.cc:2032) Memory Usage: 312320288 bytes
>>>>>> 2009-07-22 08:23:41,449 1378986304 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/AccessGroup.cc:379) Starting Major Compaction of
>>>>>> METADATA
>>>>>> [0: .. ](default)
>>>>>> 2009-07-22 08:23:41,529 1378986304 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/AccessGroup.cc:533) Finished Compaction of METADATA
>>>>>> [0: .. ](default)
>>>>>> 2009-07-22 08:23:41,530 1378986304 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/AccessGroup.cc:372) Starting InMemory Compaction of
>>>>>> METADATA[0: .. ](location)
>>>>>> 2009-07-22 08:23:41,549 1378986304 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/AccessGroup.cc:533) Finished Compaction of METADATA
>>>>>> [0: .. ](location)
>>>>>> 2009-07-22 08:23:41,549 1378986304 Hypertable.RangeServer [INFO]
>>>>>> (RangeServer/AccessGroup.cc:379) Starting Major Compaction of
>>>>>> METADATA
>>>>>> [0: .. ](logging)
>>>>>> 2009-07-22 08:23:41,552 1378986304 Hypertable.RangeServer [FATAL]
>>>>>> (Common/BloomFilter.h:47) failed expectation: m_num_bits != 0
>>
>>>>>> It seems that RangeServer cannot restore from log-replaying.
>>
>>>>>> What's the problem? How to fix it ?
>>
>>>>>> Thanks
>>
>>>>>>   -- kuer
> >


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to