Recovering ranges from crashed RangeServers is one of the high priority items Doug is working on.
-Sanjit On Jul 21, 2009, at 7:59 PM, kuer wrote: > > Hi, all, > > Another question, as one of range-servers will coredump when > replaying commit log, so I just stop rebooting it. But this time, the > whole HT system seems stop working, too. > > Client program complain socket.timeout, > > hyperspace shell hangs : > hypertable> show tables; > METADATA > kvcache > storage_se > > Elapsed time: 0.00 s > hypertable> show create table storage_se; > ^^^^^ waiting for .... ???? > > Logging messages from Hypertable.Master : > > 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [ERROR] > (AsyncComm/Comm.cc:212) No connection for 221.194.134.173:31060 > 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [WARN] (Lib/ > RangeServerClient.cc:312) Comm::send_request to 221.194.134.173:31060 > failed - COMM not connected > 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [ERROR] > find_range_and_start_scan (Lib/IntervalScanner.cc:408): > Hypertable::Exception: Comm::send_request to 221.194.134.173:31060 > failed - COMM not connected > at void Hypertable::RangeServerClient::send_message(const > sockaddr_in&, Hypertable::CommBufPtr&, Hypertable::DispatchHandler*) > (Lib/RangeServerClient.cc:314) > 2009-07-22 10:45:45,276 1350199616 Hypertable.Master [ERROR] (Master/ > MasterGc.cc:239) Error: caught exception while gc'ing: Problem > creating scanner on METADATA[..0:��] > > NOTES: 221.194.134.173 is IP of the box where RangeServer went wrong. > > My question is : > since all information are shared by all rangeserver, why not > hypertable.master reassign the ranges to other rangeserver when some > of rangeservers go out of work ??? > > thanks > > -- kuer > > > > > On 7月22日, 上午10时43分, kuer <[email protected]> wrote: >> Hi, Sanjit, >> >> I just upload the second part of range.log range.20090722.log. >> 2.gz。 >> >> the first part of range.20090722.log.1.gz is about 18MB, it exceed >> the >> limits of upload files. >> >> http://hypertable-dev.googlegroups.com/web/range.20090722.log.2.gz? >> gd... >> >> IF it is necessary, I will split the first log file and upload them. >> >> Thanks >> >> -- kuer >> >> On 7月22日, 上午10时15分, Sanjit Jhala <[email protected]> >> wrote: >> >>> Hi Kuer, >> >>> You can gzip the RangeServer log and post them to the File Upload >>> Page. Thanks for reporting this issue. >> >>> -Sanjit >> >>> On Jul 21, 2009, at 6:44 PM, kuer wrote: >> >>>> Hi, Sanjit, >> >>>> with --debug option, I get some logging message, but the file is >>>> big, >>>> how to share it with you? >> >>>> gdb backtrace of core files >> >>>> (gdb) bt >>>> #0 0x0000000000538272 in >>>> Hypertable >>>> ::BasicBloomFilter<Hypertable::MurmurHash2>::BasicBloomFilter >>>> () >>>> #1 0x000000000053d3be in >>>> Hypertable::CellStoreV1::create_bloom_filter >>>> () >>>> #2 0x000000000053e10e in Hypertable::CellStoreV1::finalize () >>>> #3 0x000000000051f112 in Hypertable::AccessGroup::run_compaction >>>> () >>>> #4 0x0000000000504e45 in >>>> Hypertable::Range::split_compact_and_shrink >>>> () >>>> #5 0x0000000000509310 in Hypertable::Range::split () >>>> #6 0x00000000004ec693 in >>>> Hypertable::MaintenanceQueue::Worker::operator() () >>>> #7 0x00000000006a5c40 in thread_proxy () >>>> #8 0x00000038ae406367 in start_thread () from /lib64/ >>>> libpthread.so.0 >>>> #9 0x00000038ad8d2f7d in clone () from /lib64/libc.so.6 >> >>>> -- kuer >> >>>> On 7月22日, 上午9时07分, Sanjit Jhala <[email protected]> >>>> wrote: >>>>> Hi Kuer, >> >>>>> This looks like a bug in the RangeServer code. The RangeServer is >>>>> trying to create a CellStore file and while creating the >>>>> CellStore's >>>>> BloomFilter its hitting an error condition. >> >>>>> Can you try a couple of things to help debug this issue? >> >>>>> Firstly turn on the RangeServer debug logging and report >>>>> RangeServer >>>>> logs. You can do this by adding the global option --debug to your >>>>> start-all-servers.sh command line. Example: < >>>>> $HYPERTABLE_INSTALL_DIR>/ >>>>> bin/start-all-servers.sh kfs --debug >> >>>>> Secondly, if you could compile a debug build and send the stack >>>>> trace >>>>> that would be helpful. To do this, from your hypertable build >>>>> directory run >>>>> ccmake <$HYPERTABLE_SRC_DIR> and make sure CMAKE_BUILD_TYPE is >>>>> set >>>>> to >>>>> Debug and install the new build. After you try to bring up the >>>>> RangeServer and it dumps core, you can load the core file in gdb >>>>> (Eg: >>>>> gdb gdb <$HYPERTABLE_INSTALL_DIR>/bin/Hypertable.RangeServer < >>>>> $CORE_FILE>). You can run bt (backtrace) in gdb to get the stack >>>>> trace. >> >>>>> -Sanjit >> >>>>> On Jul 21, 2009, at 5:36 PM, kuer wrote: >> >>>>>> Hi, all, >> >>>>>> one of RangeServers hangs after coredump and restarting . here >>>>>> are >>>>>> messages in rangeserver's log : >> >>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [WARN] >>>>>> (Lib/ >>>>>> CommitLog.cc:250) clgc LOG FRAGMENT PURGE breaking because >>>>>> 1246607682171649001 >= 1246607682128108001 (file='/hypertable/ >>>>>> servers/ >>>>>> 221.194.134.173_31060/log/root/0') >>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [WARN] >>>>>> (Lib/ >>>>>> CommitLog.cc:250) clgc LOG FRAGMENT PURGE breaking because >>>>>> 1248187695757932563 >= 1247819802453791364 (file='/hypertable/ >>>>>> servers/ >>>>>> 221.194.134.173_31060/log/metadata/2') >>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [WARN] >>>>>> (Lib/ >>>>>> CommitLog.cc:250) clgc LOG FRAGMENT PURGE breaking because >>>>>> 1248193806824860161 >= 1248189458336849002 (file='/hypertable/ >>>>>> servers/ >>>>>> 221.194.134.173_31060/log/user/401') >>>>>> 2009-07-22 08:23:41,448 1295067456 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/MaintenancePrioritizerLogCleanup.cc:103) Adding >>>>>> maintenance for range METADATA[0: .. ] because mid-split(1) >>>>>> 2009-07-22 08:23:41,449 1295067456 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/RangeServer.cc:2032) Memory Usage: 312320288 bytes >>>>>> 2009-07-22 08:23:41,449 1378986304 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/AccessGroup.cc:379) Starting Major Compaction of >>>>>> METADATA >>>>>> [0: .. ](default) >>>>>> 2009-07-22 08:23:41,529 1378986304 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/AccessGroup.cc:533) Finished Compaction of METADATA >>>>>> [0: .. ](default) >>>>>> 2009-07-22 08:23:41,530 1378986304 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/AccessGroup.cc:372) Starting InMemory Compaction of >>>>>> METADATA[0: .. ](location) >>>>>> 2009-07-22 08:23:41,549 1378986304 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/AccessGroup.cc:533) Finished Compaction of METADATA >>>>>> [0: .. ](location) >>>>>> 2009-07-22 08:23:41,549 1378986304 Hypertable.RangeServer [INFO] >>>>>> (RangeServer/AccessGroup.cc:379) Starting Major Compaction of >>>>>> METADATA >>>>>> [0: .. ](logging) >>>>>> 2009-07-22 08:23:41,552 1378986304 Hypertable.RangeServer [FATAL] >>>>>> (Common/BloomFilter.h:47) failed expectation: m_num_bits != 0 >> >>>>>> It seems that RangeServer cannot restore from log-replaying. >> >>>>>> What's the problem? How to fix it ? >> >>>>>> Thanks >> >>>>>> -- kuer > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
