Hi Jonathan, Thanks for the reply!
It would probably be enough to just ride along on the major compaction (and have those happen daily as per the default). We are not trying to evict older versions (AFAIK HBase has built in support for that). Yes, we were combining multiple versions into a single value. It is not the read performance of fetching multiple versions that I am worried about. It is the performance of combining all the versions into a single value after each read which will probably slow things down. New versions are added every five minutes typically, but we also have some 10 to 15 years worth of historical data. Apart from the fact that the actual read performance from HBase will usually be satisfactory, we would still need to do the math of combining all the versions (which can be a couple of years worth of five minute intervals) into one usable object after the read, which will slow things down. It is faster for us to add one version to an existing already combined value (which is what we do on insertion currently) than combine all versions all the time we need the combined value. I have looked intro Coprocessors and it is a promising development. Also, patching the code ourselves crossed my mind, but unfortunately we have some other hacking to do as it is... For now, we are just going to see how far we get with our current solution (the preliminary performance number don't look too worrying). Friso On May 27, 2010, at 3:18 PM, Jonathan Gray wrote: > This is not currently on any road map as far as I know. But I do think it's > interesting nonetheless. > > Piggybacking on compactions can be a good time to get some additional work > done on your data since we're already doing the work of reading and writing > several HFiles. > > One concern is compaction performance. In HBase's architecture, overall > performance can be significantly impacted by slow-running compactions. > > Another concern is that minor compactions do not always include all files of > a region. That may limit what you can effectively do during a compaction > since you may not be seeing all of the data. This is not the case for major > compactions which always compact every file in a region, however. > > Friso, for your specific use case, what you are trying to do is evict older > versions of data? I had a little bit of trouble understanding your schema. > Or what you're doing is periodically take a bunch of versions of a column and > combine them into a single version/value? How many of these versions are you > adding for each column? Is it really the case that read performance is > unacceptable if the data is spread across multiple versions? One of the > benefits of HBase is that these versions will be stored sequentially on disk > so the read of multiple versions (within reason) should be not significantly > slower than one. > > In any case, this is an interesting direction and I think it's worth > exploring. As for how this would work, that I'm not so sure about. Perhaps > building on Andrew's work with Coprocessors, RegionObservers, etc... > > JG > >> -----Original Message----- >> From: Friso van Vollenhoven [mailto:[email protected]] >> Sent: Thursday, May 27, 2010 1:34 AM >> To: [email protected] >> Subject: Re: Custom compaction >> >> Hi, >> >> Actually, for us it would be nice to be able to hook into the >> compaction, too. >> >> We store records that are basically events that occur at certain times. >> We store the record itself as qualifier and a timeline as column value >> (so multiple records+timelines per row key is possible). So when a new >> record comes in, we do a get for the timeline, merge the new timestamp >> with the existing timeline in memory and do a put to update the column >> value with the new timeline. >> >> In our first version, we just wrote the individual timestamps as values >> and used versioning to keep all timestamps in the value. Then we >> combined all the timelines and individual timestamp into a single >> timeline in memory on each read. We ran a MR job periodically to do the >> timeline combining in the table and delete the obsolete timestamps in >> order to keep read performance OK (because otherwise the read operation >> would involve a lot of additional work to create a timeline and lots of >> versions would be created). In the end, the deletes in the MR job were >> a bottleneck (as I understand, but I was not on the project at that >> moment). >> >> Now, if we could hook into the compactions, then we could just always >> insert individual timestamps as new versions and do the combining of >> versions into a single timeline during compaction (as compaction needs >> to go through the complete table anyway). This would also improve our >> insertion performance (no more gets in there, just puts like in the >> first version), which is nice. We collect internet routing information, >> which is collected at 80 million records per day with updates coming in >> in batches every 5 minutes (http://ris.ripe.net). We'd like to try to >> be efficient before just throwing more machines at the problem. >> >> Will there be anything like this on the roadmap? >> >> >> Cheers, >> Friso >> >> >> >> On May 27, 2010, at 1:01 AM, Jean-Daniel Cryans wrote: >> >>> Invisible. What's your need? >>> >>> J-D >>> >>> On Wed, May 26, 2010 at 3:56 PM, Vidhyashankar Venkataraman >>> <[email protected]> wrote: >>>> Is there a way to customize the compaction function (like a hook >> provided by the API) or is it invisible to the user? >>>> >>>> Thank you >>>> Vidhya >>>> >
