Hi Doug, thanks for the reply.  I'm glad to hear of your upcoming update.
I'll put my investigation on hold for a while then (well, I have to anyway
since I'm going on vacation).

Our current strategy is:
  1. Store things in Hypertable until something crashes
  2. Wipe out all the commit logs (losing some data)


On Mon, Sep 15, 2008 at 9:32 PM, Doug Judd <[EMAIL PROTECTED]> wrote:

> Hi Josh,
>
> Thanks for investigating!  Comments inline ...
>
> On Mon, Sep 15, 2008 at 4:41 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:
>
>> I've managed to load about 2-3 TB into a Hypertable cluster with 15
>> RangeServers.  I'm still having problems with the RangeServers failing to
>> clean up their old logs.
>>
>> Server              User commit log in MB
>> 10.10.1.194_38060   51512.8
>> 10.10.1.200_38060   48050.6
>> 10.10.1.199_38060   12464.7
>> 10.10.1.190_38060   90.8244
>> 10.10.1.193_38060   86.9587
>> 10.10.1.188_38060   70.4194
>> 10.10.1.191_38060   58.9111
>> 10.10.1.198_38060   48.7823
>> 10.10.1.192_38060   41.6324
>> 10.10.1.202_38060   34.0847
>> 10.10.1.189_38060   28.5852
>> 10.10.1.197_38060   22.3202
>> 10.10.1.196_38060   17.3054
>> 10.10.1.201_38060   14.2905
>> 10.10.1.195_38060   0
>>
>> No new updates have been added to this cluster in about 12 hours.  Most of
>> the log directories look fine but a few have gigabytes of logs that should
>> be cleaned up.
>>
>> I've added some debug logging to try and isolate what's happening.  For
>> each call to CommitLog::purge(), I'm logging the cutoff timestamp used.
>> I've also added messages to print out each of the remaining log fragments
>> after the purge breaks off:
>>
>> diff --git a/src/cc/Hypertable/Lib/CommitLog.cc
>> b/src/cc/Hypertable/Lib/CommitLog.cc
>> index a6b8920..ea0f894 100644
>> --- a/src/cc/Hypertable/Lib/CommitLog.cc
>> +++ b/src/cc/Hypertable/Lib/CommitLog.cc
>> @@ -232,6 +232,8 @@ int CommitLog::purge(uint64_t timestamp) {
>>    CommitLogFileInfo file_info;
>>    String fname;
>>
>> +  HT_DEBUGF("purge: %llu", timestamp);
>> +
>>    try {
>>
>>      while (!m_fragment_queue.empty()) {
>> @@ -243,7 +245,16 @@ int CommitLog::purge(uint64_t timestamp) {
>>          HT_INFOF("Removed log fragment file='%s' timestamp=%lld",
>> fname.c_str(), file_info.ti
>>        }
>>        else {
>> -        //HT_INFOF("LOG FRAGMENT PURGE breaking because %lld >= %lld",
>> file_info.timestamp, t
>> +        HT_DEBUGF("LOG FRAGMENT PURGE breaking because %lld >= %lld,
>> queue size = %llu",
>> +                 file_info.timestamp, timestamp,
>> m_fragment_queue.size());
>> +        size_t i = 0;
>> +        foreach(CommitLogFileInfo const & info, m_fragment_queue)
>> +        {
>> +            ++i;
>> +            HT_DEBUGF("  Fragment %llu: time=%lld size=%llu name=%s%u",
>> +                      i, info.timestamp, info.size, info.log_dir.c_str(),
>> +                      info.num);
>> +        }
>>          break;
>>        }
>>      }
>>
>> What I find is that the purge() function is still being called regularly,
>> but that the timestamp that gets passed in becomes stuck before the oldest
>> fragment and never changes after in later calls.
>>
>> Sometimes the timestamp threshold is just a few minutes earlier than the
>> oldest log's timestamp.  Sometimes the passed in threshold value is 1, which
>> seems like a suspicious number.  Here's the first few lines of a recent
>> purge from the 52 GB server:
>>
>> 1221520163 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge: 1
>> 1221520163 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
>> PURGE breaking because 1221271063903383000 >= 1, queue size = 831
>> 1221520163 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 1:
>> time=1221271063903383000 size=67436344
>> name=/hypertable/servers/10.10.1.194_38060/log/user/3
>> 1221520163 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 2:
>> time=1221271046265865000 size=896929
>> name=/hypertable/servers/10.10.1.193_38060/log/9F4D8D5249B24A2C52F06D2B/0
>> 1221520163 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 3:
>> time=1221271374609078000 size=100133731
>> name=/hypertable/servers/10.10.1.194_38060/log/user/4
>>
>> Yes, the log is currently holding 831 fragments.  Also I see compactions
>> are still happening on this server, but they start and finish in the same
>> second.
>>
>> Here's a purge from the 48 GB server:
>>
>> 1221520459 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge:
>> 1221327270724907001
>> 1221520459 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
>> PURGE breaking because 1221327332648604000 >= 1221327270724907001, queue
>> size = 622
>> 1221520459 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 1:
>> time=1221327332648604000 size=100006176
>> name=/hypertable/servers/10.10.1.200_38060/log/user/916
>> 1221520459 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 2:
>> time=1221327395726285000 size=100147244
>> name=/hypertable/servers/10.10.1.200_38060/log/user/917
>> 1221520459 DEBUG Hypertable.RangeServer :
>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 3:
>> time=1221327503701781000 size=100179860
>> name=/hypertable/servers/10.10.1.200_38060/log/user/918
>>
>> This one has 622 fragments.  The last logged compaction on this server was
>> 25 hours ago, though it apparently did some non-trivial work in that time.
>> At least, the begin and end compaction log messages have some time between
>> them.
>>
>
> This definitely sheds light on the problem.  I think I've figured out how
> the purge timestamp can get set to 1.  It's set by the following line in
> AccessGroup::run_compaction() :
>
>     m_oldest_cached_timestamp = (m_cell_cache_ptr->size() > 0) ?
> timestamp.real + 1 : 0;
>
> If after a split, the split-off portion has a few updates in the split log,
> but never receives any updates therafter, then the range could get compacted
> with timestamp.real == 0 causing the above line to set
> "m_oldest_cached_timestamp" to 1.  I'll need to investigate a little further
> to figure out the other case you point out.
>
>
>> The log fragments appear to be in increasing time order, though a lot of
>> the fragments seem to be mostly empty.  For example, on the 70 MB server,
>> the first fragment contains 70 MB of data followed by 353 fragments with
>> size=101.
>>
>
> The issue here is that when a range splits, the first thing that happens is
> a split log gets installed.  The split log is for capturing any updates to
> the portion of the range that is getting split off.  When the split finally
> completes, the RangeServer that receives the newly split off portion will be
> notified with the name of the split log and will re-play it after it loads
> the range.  After it finishes replaying the split log, it will link the
> split log into the range's primary commit log.  This is just a commit log
> entry that contains the name of the log that is to be linked in.  When it
> does this, it rolls the log.  So, in this "degenerate" primary-key order
> insert, you have a bunch of split logs that basically contain nothing being
> linked into commit logs that are not receiving any commits.  That's why you
> have a bunch of tiny log fragments, then just contain a single link entry.
> There's a program called 'dumplog' that you can use to dump out a commit
> log.  Point it to the log directory and it should print out a summary of the
> entries.
>
>
>> I've looked around a bit at where these purge thresholds come from.  When
>> the purge starts, the code tries to get the oldest cached timestamp from
>> each access group, then initiates a purge of logs older than that.  This
>> timestamp seems to be the thing getting stuck (or set to 1), so tried to
>> find out where that gets set.  I think it's set in each access group after a
>> compaction to the timestamp of the compaction request (plus 1).  Where the
>> compaction timestamp comes from is a bit of a mystery to me.  The range
>> seems to schedule the compactions either from m_state.timestamp or
>> m_timestamp, depending on whether the compaction is initiated in
>> Range::run_compaction or Range::split_compact_and_shrink.  Also, I don't
>> understand what's going on with Timestamp.real vs. Timestamp.logical.  The
>> flow of this code is just a bit hard for me to track.  ;)
>>
>
> This code is definitely questionable and obviously broken.  I've spent the
> past 1 1/2 weeks overhauling all of this logic based on two of your
> suggestions:
>
> 1. Freeze and copy the CellCache during a compaction
> 2. Separate the timestamp from the revision number
>
> I'm nearly finished with this overhaul, but have a few more systems tests
> that I need to get passing.  I hope to have this checked in within two or
> three days.  These changes have cleaned up the code tremendously.  Now, the
> m_oldest_cached_timestamp variable gets set with each call to
> AccessGroup::add() and does not get set with the line above.  Plus, I've
> combined add() and replay_add() into just add().
>
> Anyway, that's as far as I've managed to go on this problem so far.  I was
>> wondering if anyone on the list who understands this code better might be
>> able to look at the problem description and say, "A-ha!  I know what's going
>> wrong here!"  (And hopefully issue a patch.)  Otherwise, I'll continue
>> unravelling a bit at a time.
>>
>
> I should have the new code checked in within a few days, which should take
> care of these problems, so I wouldn't bother spending too much time on
> this.  Thanks for all of your help.
>
> - Doug
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to