Yes, QiFan is correct, I forgot that the hash functions must preserve the order, and there is no way to detect collision when doing the calculation, so my suggestion make no sense. Maybe one should NOT use a long VARCHAR as primary key in the first place :-)
Thanks, Ming -----邮件原件----- 发件人: Qifan Chen [mailto:[email protected]] 发送时间: 2016年2月13日 0:10 收件人: dev <[email protected]> 主题: Re: how the SALT is caculated? Hi Eric, I am sure you know to find a colision-free hashing for a data set is very hard :-(. This is the ACM paper that I described briefly in a separate email: http://dl.acm.org/citation.cfm?id=129623. Basically, we constructed minimally perfect hash functions (MPHF) for large data sets. Each key is mapped to an integer, no collisions, and # of integers is exactly the same as the size of the data set. The function itself is just an array of computed bits of almost minimal in size. It is trivial to produce an order preserving hash function with MPHF. The problem with using the research result commercially is going through the university. For the long key problem on hand, I wonder if you have thought about using compressions. Thanks --Qifan On Fri, Feb 12, 2016 at 9:28 AM, Eric Owhadi <[email protected]> wrote: > Hi Ming, > not sure what you are trying to implement but I am going to guess a > use > case: > > Sometime, the primary key construct in trafodion is long, and > contains strings with large max character. > Given that these keys end up exploded and padded with zero on the > hbase key, an optimization could consist in putting a hash of these > long strings instead of them, especially if we cannot benefit from > keyed access. > > So for this use, making the hash unique is key. I had experienced > trying this idea with a 64 bit hash (using hash2partfunc twice to make > a 64 bit), and loading a 170 000 000 table, and got duplicates (hash > collision). So if your use case is around the same idea, please > consider more than 64 bit hashing function. The hash code that is used > for partitioning does not care about collision since it is just used for > partitioning... > > Not sure if this helps, > Regards, > Eric > > -----Original Message----- > From: Liu, Ming (Ming) [mailto:[email protected]] > Sent: Friday, February 12, 2016 9:07 AM > To: [email protected] > Subject: 答复: how the SALT is caculated? > > Thanks QiFan, > > Following your hint, I found the ExHDPHash::eval() and corresponding > hash() functions. Trying to understand them. > > Thanks, > Ming > > -----邮件原件----- > 发件人: Qifan Chen [mailto:[email protected]] > 发送时间: 2016年2月12日 21:32 > 收件人: dev <[email protected]> > 主题: Re: how the SALT is caculated? > > Hi Ming, > > In trafodion, > > "salt using 8 partitions on A" is equivalent to "hash2partfunc(a for 8)". > > "salt using 16 partitions on (a,b)" is equivalent to > "hash2partfunc(a,b for 16)". > > > Thanks --Qifan > > > > On Fri, Feb 12, 2016 at 6:15 AM, Liu, Ming (Ming) <[email protected]> > wrote: > > > Hi, all, > > > > I want to check the code that calculate the hash value for the > > _SALT_ column in Trafodion. Could anyone point me to the exact > > source code, which file and which function doing that? > > I tried for a while and cannot find it yet. > > > > So that I can write a function F, that F(all cluster key) => rowkey > > of the Trafodion table row. > > > > Thanks, > > Ming > > > > > > > -- > Regards, --Qifan > -- Regards, --Qifan
