[hypertable-dev] Re: Questions about the code

Luke Mon, 27 Oct 2008 10:44:39 -0700

I'm glad someone is reading the code :) I have some similar questions
about Doug's recent changes. I'll try my best to answer the questions.

On Oct 27, 1:55 am, donald <[EMAIL PROTECTED]> wrote:
> Hi Luke,
>
> While reading Hypertable source code these days, I've met some
> questions, would you please explain?
>
> About log clean up:
> 1. In RangeServer::log_cleanup() :
>    // skip root
>    if (!range_vec.empty() && range_vec[0]->end_row() ==
> Key::END_ROOT_ROW)
>      range_vec.erase(range_vec.begin());
>
>    This indicates that root commit logs are never cleaned, but why?

Root range is an IN MEMORY range, which is never written to cell
stores. It never splits as well. So you only have one contiguous log
file that doesn't need cleanup.

> 2. In the last if-statement of
> RangeServer::schedule_log_cleanup_compactions() :
>    // Purge the commit log
>    if (earliest_cached_revision != TIMESTAMP_NULL)
>      log->purge(earliest_cached_revision);
>
>    If earliest_cached_revision == TIMESTAMP_NULL, all cell caches of
> this range server should be empty, i.e. all cells are saved safely in
> cell store files. In this case, should we purge all commit logs on
> this range server instead of doing nothing?

I think this could be a bug. Have you tried to fix it and see if it
solves the cleanup problem?

> About fast recovery:
> 1. When a range server is replaying the commit log for fast recovery,
> how does it know to start from which log entry? Does it skip those log
> entries that are already saved in cell stores anyway?

Cell stores have a saved timestamp, which can be used to skip old log
entries.

> 2. The range meta log doesn't contain cell store filenames. Instead,
> before reloading cell stores, range servers must read METADATA table
> to get this information. This restricts the order of recovery: root
> table must be recovered first, then METADATA, then user tables. When
> there are many range servers waiting to recover, how is this order
> guaranteed? Do they just retry again and again blindly or make use of
> a coordinator?

They just keep retrying (the logic is in range server client) until
the range is available.

> Would it be better if we also log cell store filenames
> in the meta log?

We thought it would be harder to handle the potential race conditions
from various compactions happen in the background. Plus you'd need to
update the filenames on every compaction, which bloats the metalog
even more. The current scheme only needs to handle range transaction
itself, which means the metalog itself is much smaller and loads
faster. The current scheme is conceptually cleaner, IMHO.

> About range split:
> 1. When a range has split log installed and is doing major compaction.
> If there are updates to this range, those updates going to the upper
> half range is added to cell cache as usual, while others get written
> into split log. My question is: should these split-off updates be
> added into the cell cache also? If not, these new cells won't be
> available for scan before the split is done and lower half range
> loaded on another server.

I think it's a good idea. Especially in the current new cell cache
scheme. It's harder to do it correctly in the old schemes. I'm sure
Doug can provide more details.

> 2. When the major compaction is done, the original range first shrinks
> then notifies the master to choose a new server and load the split-off
> range. I wonder if the notification could be sent before shrink? The
> shrink and load range processes should be able to work concurrently,
> this change should shorten offline time of the split-off range a
> little bit.

Yes, that could be an optimization. But the difference in latency at
this stage is less than a second though.

> 3. How does RangeUpdateBarrier actually works? It looks like a
> semaphore, but why does a simple mutex not work?

It's a way to do finer grain locking. Think about multiple long
running tasks (like compactions) that need to access a range (but not
all the time). Using a mutex would serialize these long running tasks.

> Others:
> 1. What does RangeState::soft_limit mean? It's calculated and passed
> around, but never actually used in the code.

It's used to determine split: disk_usage >
range_vector[rangei].range_ptr->get_size_limit()  in the
RangeServer::update. soft_limit dynamically approaches the max range
bytes after each split. The goal is to have lower split limits early
on, so you can leverage parallel updates earlier.

> 2. MergeScanner::m_cell_cutoff is also calculated but not used, based
> on my understanding I think it is used to implement TTL of cell, it
> seems there are only a few more lines of code missing to make this
> feature effective. What's the concerns not to implement it right now?

Seems like it.  Doug would know more about this one.

>
> 3. In AccessGroup::run_compaction(), m_compression_ratio is simply the
> average of all cell store's compression ratio. Would a weighed average
> be better?

Yes, a weighted average would be better in this case.

__Luke
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to hypertable-dev@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Questions about the code

Reply via email to