Thanks very much, Luke! I'm going to read your reply later. Please be prepared to answer more questions then :)
Donald On Tue, Oct 28, 2008 at 1:32 AM, Luke <[EMAIL PROTECTED]> wrote: > > I'm glad someone is reading the code :) I have some similar questions > about Doug's recent changes. I'll try my best to answer the questions. > > On Oct 27, 1:55 am, donald <[EMAIL PROTECTED]> wrote: >> Hi Luke, >> >> While reading Hypertable source code these days, I've met some >> questions, would you please explain? >> >> About log clean up: >> 1. In RangeServer::log_cleanup() : >> // skip root >> if (!range_vec.empty() && range_vec[0]->end_row() == >> Key::END_ROOT_ROW) >> range_vec.erase(range_vec.begin()); >> >> This indicates that root commit logs are never cleaned, but why? > > Root range is an IN MEMORY range, which is never written to cell > stores. It never splits as well. So you only have one contiguous log > file that doesn't need cleanup. > >> 2. In the last if-statement of >> RangeServer::schedule_log_cleanup_compactions() : >> // Purge the commit log >> if (earliest_cached_revision != TIMESTAMP_NULL) >> log->purge(earliest_cached_revision); >> >> If earliest_cached_revision == TIMESTAMP_NULL, all cell caches of >> this range server should be empty, i.e. all cells are saved safely in >> cell store files. In this case, should we purge all commit logs on >> this range server instead of doing nothing? > > I think this could be a bug. Have you tried to fix it and see if it > solves the cleanup problem? > >> About fast recovery: >> 1. When a range server is replaying the commit log for fast recovery, >> how does it know to start from which log entry? Does it skip those log >> entries that are already saved in cell stores anyway? > > Cell stores have a saved timestamp, which can be used to skip old log > entries. > >> 2. The range meta log doesn't contain cell store filenames. Instead, >> before reloading cell stores, range servers must read METADATA table >> to get this information. This restricts the order of recovery: root >> table must be recovered first, then METADATA, then user tables. When >> there are many range servers waiting to recover, how is this order >> guaranteed? Do they just retry again and again blindly or make use of >> a coordinator? > > They just keep retrying (the logic is in range server client) until > the range is available. > >> Would it be better if we also log cell store filenames >> in the meta log? > > We thought it would be harder to handle the potential race conditions > from various compactions happen in the background. Plus you'd need to > update the filenames on every compaction, which bloats the metalog > even more. The current scheme only needs to handle range transaction > itself, which means the metalog itself is much smaller and loads > faster. The current scheme is conceptually cleaner, IMHO. > >> About range split: >> 1. When a range has split log installed and is doing major compaction. >> If there are updates to this range, those updates going to the upper >> half range is added to cell cache as usual, while others get written >> into split log. My question is: should these split-off updates be >> added into the cell cache also? If not, these new cells won't be >> available for scan before the split is done and lower half range >> loaded on another server. > > I think it's a good idea. Especially in the current new cell cache > scheme. It's harder to do it correctly in the old schemes. I'm sure > Doug can provide more details. > >> 2. When the major compaction is done, the original range first shrinks >> then notifies the master to choose a new server and load the split-off >> range. I wonder if the notification could be sent before shrink? The >> shrink and load range processes should be able to work concurrently, >> this change should shorten offline time of the split-off range a >> little bit. > > Yes, that could be an optimization. But the difference in latency at > this stage is less than a second though. > >> 3. How does RangeUpdateBarrier actually works? It looks like a >> semaphore, but why does a simple mutex not work? > > It's a way to do finer grain locking. Think about multiple long > running tasks (like compactions) that need to access a range (but not > all the time). Using a mutex would serialize these long running tasks. > >> Others: >> 1. What does RangeState::soft_limit mean? It's calculated and passed >> around, but never actually used in the code. > > It's used to determine split: disk_usage > > range_vector[rangei].range_ptr->get_size_limit() in the > RangeServer::update. soft_limit dynamically approaches the max range > bytes after each split. The goal is to have lower split limits early > on, so you can leverage parallel updates earlier. > >> 2. MergeScanner::m_cell_cutoff is also calculated but not used, based >> on my understanding I think it is used to implement TTL of cell, it >> seems there are only a few more lines of code missing to make this >> feature effective. What's the concerns not to implement it right now? > > Seems like it. Doug would know more about this one. > >> >> 3. In AccessGroup::run_compaction(), m_compression_ratio is simply the >> average of all cell store's compression ratio. Would a weighed average >> be better? > > Yes, a weighted average would be better in this case. > > __Luke > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
