James, Its evenly distributed, however... because its a time stamp, its a 'tail end charlie' addition. So when you split a region, the top half is never added to, so you end up with all regions half filled except for the last region in each 'modded' value.
I wouldn't say its a bad thing if you plan for it. On Oct 21, 2013, at 5:07 PM, James Taylor <[email protected]> wrote: > We don't truncate the hash, we mod it. Why would you expect that data > wouldn't be evenly distributed? We've not seen this to be the case. > > > > On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel > <[email protected]>wrote: > >> What do you call hashing the row key? >> Or hashing the row key and then appending the row key to the hash? >> Or hashing the row key, truncating the hash value to some subset and then >> appending the row key to the value? >> >> The problem is that there is specific meaning to the term salt. Re-using >> it here will cause confusion because you're implying something you don't >> mean to imply. >> >> you could say prepend a truncated hash of the key, however… is prepend a >> real word? ;-) (I am sorry, I am not a grammar nazi, nor an English major. ) >> >> So even outside of Phoenix, the concept is the same. >> Even with a truncated hash, you will find that over time, all but the tail >> N regions will only be half full. >> This could be both good and bad. >> >> (Where N is your number 8 or 16 allowable hash values.) >> >> You've solved potentially one problem… but still have other issues that >> you need to address. >> I guess the simple answer is to double the region sizes and not care that >> most of your regions will be 1/2 the max size… but the size you really >> want and 8-16 regions will be up to twice as big. >> >> >> >> On Oct 21, 2013, at 3:26 PM, James Taylor <[email protected]> wrote: >> >>> What do you think it should be called, because >>> "prepending-row-key-with-single-hashed-byte" doesn't have a very good >> ring >>> to it. :-) >>> >>> Agree that getting the row key design right is crucial. >>> >>> The range of "prepending-row-key-with-single-hashed-byte" is declarative >>> when you create your table in Phoenix, so you typically declare an upper >>> bound based on your cluster size (not 255, but maybe 8 or 16). We've run >>> the numbers and it's typically faster, but as with most things, not >> always. >>> >>> HTH, >>> James >>> >>> >>> On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel < >> [email protected]>wrote: >>> >>>> Then its not a SALT. And please don't use the term 'salt' because it has >>>> specific meaning outside to what you want it to mean. Just like saying >>>> HBase has ACID because you write the entire row as an atomic element. >> But >>>> I digress…. >>>> >>>> Ok so to your point… >>>> >>>> 1 byte == 255 possible values. >>>> >>>> So which will be faster. >>>> >>>> creating a list of the 1 byte truncated hash of each possible timestamp >> in >>>> your range, or doing 255 separate range scans with the start and stop >> range >>>> key set? >>>> >>>> That will give you the results you want, however… I'd go back and have >>>> them possibly rethink the row key if they can … assuming this is the >> base >>>> access pattern. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> >>>> >>>> >>>> >>>> On Oct 21, 2013, at 11:37 AM, James Taylor <[email protected]> >> wrote: >>>> >>>>> Phoenix restricts salting to a single byte. >>>>> Salting perhaps is misnamed, as the salt byte is a stable hash based on >>>> the >>>>> row key. >>>>> Phoenix's skip scan supports sub-key ranges. >>>>> We've found salting in general to be faster (though there are cases >> where >>>>> it's not), as it ensures better parallelization. >>>>> >>>>> Regards, >>>>> James >>>>> >>>>> >>>>> >>>>> On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov >>>>> <[email protected]>wrote: >>>>> >>>>>> FuzzyRowFilter does not work on sub-key ranges. >>>>>> Salting is bad for any scan operation, unfortunately. When salt prefix >>>>>> cardinality is small (1-2 bytes), >>>>>> one can try something similar to FuzzyRowFilter but with additional >>>>>> sub-key range support. >>>>>> If salt prefix cardinality is high (> 2 bytes) - do a full scan with >>>> your >>>>>> own Filter (for timestamp ranges). >>>>>> >>>>>> Best regards, >>>>>> Vladimir Rodionov >>>>>> Principal Platform Engineer >>>>>> Carrier IQ, www.carrieriq.com >>>>>> e-mail: [email protected] >>>>>> >>>>>> ________________________________________ >>>>>> From: Premal Shah [[email protected]] >>>>>> Sent: Sunday, October 20, 2013 10:42 PM >>>>>> To: user >>>>>> Subject: Re: row filter - binary comparator at certain range >>>>>> >>>>>> Have you looked at FuzzyRowFilter? Seems to me that it might satisfy >>>> your >>>>>> use-case. >>>>>> >>>>>> >>>> >> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ >>>>>> >>>>>> >>>>>> On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan <[email protected]> >> wrote: >>>>>> >>>>>>> Alex Vasilenko <aa.vasilenko@...> writes: >>>>>>> >>>>>>>> >>>>>>>> Lars, >>>>>>>> >>>>>>>> But how it will behave, when I have salt at the beginning of the key >>>> to >>>>>>>> properly shard table across regions? Imagine row key of format >>>>>>>> salt:timestamp and rows goes like this: >>>>>>>> ... >>>>>>>> 1:15 >>>>>>>> 1:16 >>>>>>>> 1:17 >>>>>>>> 1:23 >>>>>>>> 2:3 >>>>>>>> 2:5 >>>>>>>> 2:12 >>>>>>>> 2:15 >>>>>>>> 2:19 >>>>>>>> 2:25 >>>>>>>> ... >>>>>>>> >>>>>>>> And I want to find all rows, that has second part (timestamp) in >> range >>>>>>>> 15-25. What startKey and endKey should be used? >>>>>>>> >>>>>>>> Alexandr Vasilenko >>>>>>>> Web Developer >>>>>>>> Skype:menterr >>>>>>>> mob: +38097-611-45-99 >>>>>>>> >>>>>>>> 2012/2/9 lars hofhansl <lhofhansl@...> >>>>>>> Hi, >>>>>>> Alexandr Vasilenko >>>>>>> Have you ever resolved this issue?i am also facing this iusse. >>>>>>> i also want implement this functionality. >>>>>>> Imagine row key of format >>>>>>> salt:timestamp and rows goes like this: >>>>>>> ... >>>>>>> 1:15 >>>>>>> 1:16 >>>>>>> 1:17 >>>>>>> 1:23 >>>>>>> 2:3 >>>>>>> 2:5 >>>>>>> 2:12 >>>>>>> 2:15 >>>>>>> 2:19 >>>>>>> 2:25 >>>>>>> ... >>>>>>> >>>>>>> And I want to find all rows, that has second part (timestamp) in >> range >>>>>>> 15-25. >>>>>>> >>>>>>> Could you please tell me how you resolve this ? >>>>>>> thanks in advance. >>>>>>> >>>>>>> >>>>>>> Tony duan >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Premal Shah. >>>>>> >>>>>> Confidentiality Notice: The information contained in this message, >>>>>> including any attachments hereto, may be confidential and is intended >>>> to be >>>>>> read only by the individual or entity to whom this message is >>>> addressed. If >>>>>> the reader of this message is not the intended recipient or an agent >> or >>>>>> designee of the intended recipient, please note that any review, use, >>>>>> disclosure or distribution of this message or its attachments, in any >>>> form, >>>>>> is strictly prohibited. If you have received this message in error, >>>> please >>>>>> immediately notify the sender and/or [email protected] and >>>>>> delete or destroy any copy of this message and its attachments. >>>>>> >>>> >>>> >> >> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
