[hypertable-dev] Re: Questions about the code

Liu Kejia (Donald) Mon, 27 Oct 2008 18:51:41 -0700

Thanks very much, Luke!

I'm going to read your reply later. Please be prepared to answer more
questions then :)


Donald

On Tue, Oct 28, 2008 at 1:32 AM, Luke <[EMAIL PROTECTED]> wrote:
>
> I'm glad someone is reading the code :) I have some similar questions
> about Doug's recent changes. I'll try my best to answer the questions.
>
> On Oct 27, 1:55 am, donald <[EMAIL PROTECTED]> wrote:
>> Hi Luke,
>>
>> While reading Hypertable source code these days, I've met some
>> questions, would you please explain?
>>
>> About log clean up:
>> 1. In RangeServer::log_cleanup() :
>>    // skip root
>>    if (!range_vec.empty() && range_vec[0]->end_row() ==
>> Key::END_ROOT_ROW)
>>      range_vec.erase(range_vec.begin());
>>
>>    This indicates that root commit logs are never cleaned, but why?
>
> Root range is an IN MEMORY range, which is never written to cell
> stores. It never splits as well. So you only have one contiguous log
> file that doesn't need cleanup.
>
>> 2. In the last if-statement of
>> RangeServer::schedule_log_cleanup_compactions() :
>>    // Purge the commit log
>>    if (earliest_cached_revision != TIMESTAMP_NULL)
>>      log->purge(earliest_cached_revision);
>>
>>    If earliest_cached_revision == TIMESTAMP_NULL, all cell caches of
>> this range server should be empty, i.e. all cells are saved safely in
>> cell store files. In this case, should we purge all commit logs on
>> this range server instead of doing nothing?
>
> I think this could be a bug. Have you tried to fix it and see if it
> solves the cleanup problem?
>
>> About fast recovery:
>> 1. When a range server is replaying the commit log for fast recovery,
>> how does it know to start from which log entry? Does it skip those log
>> entries that are already saved in cell stores anyway?
>
> Cell stores have a saved timestamp, which can be used to skip old log
> entries.
>
>> 2. The range meta log doesn't contain cell store filenames. Instead,
>> before reloading cell stores, range servers must read METADATA table
>> to get this information. This restricts the order of recovery: root
>> table must be recovered first, then METADATA, then user tables. When
>> there are many range servers waiting to recover, how is this order
>> guaranteed? Do they just retry again and again blindly or make use of
>> a coordinator?
>
> They just keep retrying (the logic is in range server client) until
> the range is available.
>
>> Would it be better if we also log cell store filenames
>> in the meta log?
>
> We thought it would be harder to handle the potential race conditions
> from various compactions happen in the background. Plus you'd need to
> update the filenames on every compaction, which bloats the metalog
> even more. The current scheme only needs to handle range transaction
> itself, which means the metalog itself is much smaller and loads
> faster. The current scheme is conceptually cleaner, IMHO.
>
>> About range split:
>> 1. When a range has split log installed and is doing major compaction.
>> If there are updates to this range, those updates going to the upper
>> half range is added to cell cache as usual, while others get written
>> into split log. My question is: should these split-off updates be
>> added into the cell cache also? If not, these new cells won't be
>> available for scan before the split is done and lower half range
>> loaded on another server.
>
> I think it's a good idea. Especially in the current new cell cache
> scheme. It's harder to do it correctly in the old schemes. I'm sure
> Doug can provide more details.
>
>> 2. When the major compaction is done, the original range first shrinks
>> then notifies the master to choose a new server and load the split-off
>> range. I wonder if the notification could be sent before shrink? The
>> shrink and load range processes should be able to work concurrently,
>> this change should shorten offline time of the split-off range a
>> little bit.
>
> Yes, that could be an optimization. But the difference in latency at
> this stage is less than a second though.
>
>> 3. How does RangeUpdateBarrier actually works? It looks like a
>> semaphore, but why does a simple mutex not work?
>
> It's a way to do finer grain locking. Think about multiple long
> running tasks (like compactions) that need to access a range (but not
> all the time). Using a mutex would serialize these long running tasks.
>
>> Others:
>> 1. What does RangeState::soft_limit mean? It's calculated and passed
>> around, but never actually used in the code.
>
> It's used to determine split: disk_usage >
> range_vector[rangei].range_ptr->get_size_limit()  in the
> RangeServer::update. soft_limit dynamically approaches the max range
> bytes after each split. The goal is to have lower split limits early
> on, so you can leverage parallel updates earlier.
>
>> 2. MergeScanner::m_cell_cutoff is also calculated but not used, based
>> on my understanding I think it is used to implement TTL of cell, it
>> seems there are only a few more lines of code missing to make this
>> feature effective. What's the concerns not to implement it right now?
>
> Seems like it.  Doug would know more about this one.
>
>>
>> 3. In AccessGroup::run_compaction(), m_compression_ratio is simply the
>> average of all cell store's compression ratio. Would a weighed average
>> be better?
>
> Yes, a weighted average would be better in this case.
>
> __Luke
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Questions about the code

Reply via email to