Re: [Mono-dev] String.GetHashCode on Mac
David, We may have hit this before. See http://www.nabble.com/String.GetHashCode-Discussion.-to18496625ef1367.html#a18496625 I lost tack of this issue after that thread but maybe it will help someone else remember. -bill On Mon, Dec 1, 2008 at 9:25 AM, David Suarez [EMAIL PROTECTED] wrote: Hi, I've run into a problem with the String.GetHashCode method in macosx (10.5.3, intel). The point is I get the same hashcode for very different strings, apparently only taking into account the last 7 chars of the string. This happens with mono 2.0 installer. Consider this test case: using System; namespace hashtest { class Class2 { private static string[] testStrings2 = new string[] { something to worry about, ing to worry about, to worry about, orry about, y about, about }; [STAThread] static void Main(string[] args) { foreach (string s in testStrings2) PrintHash(s, s.GetHashCode()); } private static void PrintHash(string str, int hash) { Console.WriteLine( hash [{1}] line: [{0}], str, hash.ToString(X8)); } } } Running on mac, I get this result: hash [9DA57994] line: [something to worry about] hash [9DA57994] line: [ing to worry about] hash [9DA57994] line: [to worry about] hash [9DA57994] line: [orry about] hash [9DA57994] line: [y about] hash [1DA57994] line: [ about] It seems to me that the String.GetHashCode in String.cs is not getting called, because this method works fine (I tested separately). Is this happening to anybody else? Any clue on what code is being called on the mac for getting the hash code of strings? Thanks, David Is this happening to anybody else?? ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
[Mono-dev] String.GetHashCode on Mac
Hi, I've run into a problem with the String.GetHashCode method in macosx (10.5.3, intel). The point is I get the same hashcode for very different strings, apparently only taking into account the last 7 chars of the string. This happens with mono 2.0 installer. Consider this test case: using System; namespace hashtest { class Class2 { private static string[] testStrings2 = new string[] { something to worry about, ing to worry about, to worry about, orry about, y about, about }; [STAThread] static void Main(string[] args) { foreach (string s in testStrings2) PrintHash(s, s.GetHashCode()); } private static void PrintHash(string str, int hash) { Console.WriteLine( hash [{1}] line: [{0}], str, hash.ToString(X8)); } } } Running on mac, I get this result: hash [9DA57994] line: [something to worry about] hash [9DA57994] line: [ing to worry about] hash [9DA57994] line: [to worry about] hash [9DA57994] line: [orry about] hash [9DA57994] line: [y about] hash [1DA57994] line: [ about] It seems to me that the String.GetHashCode in String.cs is not getting called, because this method works fine (I tested separately). Is this happening to anybody else? Any clue on what code is being called on the mac for getting the hash code of strings? Thanks, David Is this happening to anybody else?? ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode Discussion.
Hello Andreas, Imho we should revert the change, especially as MS seems to also have an implementation with low collisions (and in fact this will be what people are expecting from a hashcode without any further explanation) we should do the same. Otherwise this might drive some implementations into huge perf problems. I reverted my changes for now to produce same bad/good results. However, I tried to port another implementation which seems to produce much more better results for both scenarios but it still needs more testing. I'll send a patch for review when I am done. Marek ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode Discussion.
Hey, With the recent talk on GetHashCode(), I was just taking a look at the code. I decided to try my hand at seeing what would happen performancewise if i made the function work with an int* as opposed to char*. Turns out i ended up with something which was 50% slower, but also had 50% less collisions on a rather large english dictionary dataset (17mb of words, probably a few dupes). Just in case it's of any use, i've attached it here, though i'd say there are much better ones out there. This was purely accidental ;) public unsafe static int GetHashCode(string s) { int h = 0; int rem = s.Length 3; fixed (char* c = s) { int* start = (int*)c; int* end = start + s.Length - rem; for (int* i = end; i = start; i--) h = (h 5) - h + *i; char* cc = c + s.Length - rem; for (char* i = cc + rem; i = cc; i--) h = (h 5) - h + *i; return h; } } On Mon, Jul 21, 2008 at 12:33 PM, Marek Safar [EMAIL PROTECTED] wrote: Hello Andreas, Imho we should revert the change, especially as MS seems to also have an implementation with low collisions (and in fact this will be what people are expecting from a hashcode without any further explanation) we should do the same. Otherwise this might drive some implementations into huge perf problems. I reverted my changes for now to produce same bad/good results. However, I tried to port another implementation which seems to produce much more better results for both scenarios but it still needs more testing. I'll send a patch for review when I am done. Marek ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
[Mono-dev] String.GetHashCode Discussion.
I have been asked to move this discussion to the e-mail list from IRC. Basically we (my company and I) have new unit tests errors in 2.0 that did not occur at 1.9. The errors were traced to the String.GetHashCode change. I had one of our interns (Mike) research the change and I wanted to share his findings with you. His final document s attached. I can also send the test case Mike used to generate his data. I believe it is too large to send to all of your inboxes. http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm (1:28:15 PM) holmes: Have there been any discussions about r106887 which was a change to String.GetHashCode? (1:31:54 PM) holmes: This change is causing some of our tests to fail. I will admit that our tests are relying on the sting hash code a bit too much. However after some investigation I am wondering if the new algorithm is generating a unique enough result. (I say that as respectfully as I can.) (1:34:33 PM) holmes: We conducted a short test where 58000 English string were tested and the new code only results in 60% unique results compared to the previous code's performance of 100% (1:38:36 PM) holmes: We have found the current algorithm only considers the last ~6 characters of a string, which may be the problem. That is the last thing I will say here I promise. (1:42:21 PM) kumpera: marek: how have you tested the distribution of your new string hashing algorithm? (1:42:54 PM) marek: kumpera: I used english dictionary (1:43:57 PM) marek: holmes: how big is your dictionary? (1:45:05 PM) holmes: marek: 58110 (1:46:12 PM) holmes: I think the real problem is not that a word hashes unique enough as a sentence producing a unique result. (1:46:56 PM) holmes: the last 6 characters define the whole algorithm. (or at least I think so) (1:56:50 PM) lupus: it's likely better to revert the change, string hashing is also supposed to be synced with the implementation in C in the runtime anyway -bill ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode Discussion.
Hello, Here are some data which hopefully bring some light to this topic. I didn't measure uniqueness of hashcodes as I consider it less important as range distribution. Enclosed you can find the result for you data with old and new string.GetHashCode implementation. Some numbers to look at: ** Old implementation Highest contentions: 2985, 1155, 1137 ** Current implementation Highest contentions: 236, 217, 210 Marek I have been asked to move this discussion to the e-mail list from IRC. Basically we (my company and I) have new unit tests errors in 2.0 that did not occur at 1.9. The errors were traced to the String.GetHashCode change. I had one of our interns (Mike) research the change and I wanted to share his findings with you. His final document s attached. I can also send the test case Mike used to generate his data. I believe it is too large to send to all of your inboxes. http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm (1:28:15 PM) holmes: Have there been any discussions about r106887 which was a change to String.GetHashCode? (1:31:54 PM) holmes: This change is causing some of our tests to fail. I will admit that our tests are relying on the sting hash code a bit too much. However after some investigation I am wondering if the new algorithm is generating a unique enough result. (I say that as respectfully as I can.) (1:34:33 PM) holmes: We conducted a short test where 58000 English string were tested and the new code only results in 60% unique results compared to the previous code's performance of 100% (1:38:36 PM) holmes: We have found the current algorithm only considers the last ~6 characters of a string, which may be the problem. That is the last thing I will say here I promise. (1:42:21 PM) kumpera: marek: how have you tested the distribution of your new string hashing algorithm? (1:42:54 PM) marek: kumpera: I used english dictionary (1:43:57 PM) marek: holmes: how big is your dictionary? (1:45:05 PM) holmes: marek: 58110 (1:46:12 PM) holmes: I think the real problem is not that a word hashes unique enough as a sentence producing a unique result. (1:46:56 PM) holmes: the last 6 characters define the whole algorithm. (or at least I think so) (1:56:50 PM) lupus: it's likely better to revert the change, string hashing is also supposed to be synced with the implementation in C in the runtime anyway -bill ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list Range from,Range To,Old Implementation,New Implementation -2147483647,-2143289343,35,42 -2143289343,-2139095039,44,36 -2139095039,-2134900735,47,66 -2134900735,-2130706431,98,71 -2130706431,-2126512127,32,42 -2126512127,-2122317823,85,62 -2122317823,-2118123519,51,73 -2118123519,-2113929215,44,60 -2113929215,-2109734911,44,45 -2109734911,-2105540607,26,127 -2105540607,-2101346303,43,88 -2101346303,-2097151999,42,83 -2097151999,-2092957695,52,55 -2092957695,-2088763391,31,26 -2088763391,-2084569087,39,38 -2084569087,-2080374783,34,47 -2080374783,-2076180479,31,63 -2076180479,-2071986175,38,21 -2071986175,-2067791871,40,32 -2067791871,-2063597567,28,29 -2063597567,-2059403263,51,18 -2059403263,-2055208959,33,26 -2055208959,-2051014655,29,46 -2051014655,-2046820351,26,128 -2046820351,-2042626047,35,48 -2042626047,-2038431743,43,54 -2038431743,-2034237439,26,52 -2034237439,-2030043135,39,38 -2030043135,-2025848831,44,46 -2025848831,-2021654527,45,29 -2021654527,-2017460223,31,79 -2017460223,-2013265919,39,79 -2013265919,-2009071615,65,96 -2009071615,-2004877311,61,77 -2004877311,-2000683007,69,35 -2000683007,-1996488703,58,57 -1996488703,-1992294399,58,66 -1992294399,-1988100095,25,51 -1988100095,-1983905791,38,47 -1983905791,-1979711487,30,70 -1979711487,-1975517183,38,49 -1975517183,-1971322879,29,30 -1971322879,-1967128575,32,39 -1967128575,-1962934271,83,48 -1962934271,-1958739967,43,59 -1958739967,-1954545663,26,42 -1954545663,-1950351359,26,33 -1950351359,-1946157055,26,68 -1946157055,-1941962751,26,32 -1941962751,-1937768447,24,50 -1937768447,-1933574143,38,44 -1933574143,-1929379839,33,29 -1929379839,-1925185535,49,92 -1925185535,-1920991231,35,43 -1920991231,-1916796927,31,38 -1916796927,-1912602623,30,29 -1912602623,-1908408319,33,40 -1908408319,-1904214015,22,54 -1904214015,-1900019711,37,43 -1900019711,-1895825407,92,33 -1895825407,-1891631103,41,65 -1891631103,-1887436799,72,72 -1887436799,-1883242495,70,61 -1883242495,-1879048191,52,100 -1879048191,-1874853887,47,115 -1874853887,-1870659583,19,119 -1870659583,-1866465279,58,45 -1866465279,-1862270975,39,49 -1862270975,-1858076671,44,56 -1858076671,-1853882367,63,41 -1853882367,-1849688063,53,33 -1849688063,-1845493759,42,87 -1845493759,-1841299455,30,34 -1841299455,-1837105151,32,34 -1837105151,-1832910847,48,217 -1832910847,-1828716543,41,59 -1828716543,-1824522239,26,42 -1824522239,-1820327935,30,99 -1820327935,-1816133631,24,27
Re: [Mono-dev] String.GetHashCode Discussion.
Hi, Where is this hashcode implementation taken from ? I don't think we should invent new ones. Zoltan 2008/7/17 Marek Safar [EMAIL PROTECTED]: Hello, Here are some data which hopefully bring some light to this topic. I didn't measure uniqueness of hashcodes as I consider it less important as range distribution. Enclosed you can find the result for you data with old and new string.GetHashCode implementation. Some numbers to look at: ** Old implementation Highest contentions: 2985, 1155, 1137 ** Current implementation Highest contentions: 236, 217, 210 Marek I have been asked to move this discussion to the e-mail list from IRC. Basically we (my company and I) have new unit tests errors in 2.0 that did not occur at 1.9. The errors were traced to the String.GetHashCode change. I had one of our interns (Mike) research the change and I wanted to share his findings with you. His final document s attached. I can also send the test case Mike used to generate his data. I believe it is too large to send to all of your inboxes. http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm (1:28:15 PM) holmes: Have there been any discussions about r106887 which was a change to String.GetHashCode? (1:31:54 PM) holmes: This change is causing some of our tests to fail. I will admit that our tests are relying on the sting hash code a bit too much. However after some investigation I am wondering if the new algorithm is generating a unique enough result. (I say that as respectfully as I can.) (1:34:33 PM) holmes: We conducted a short test where 58000 English string were tested and the new code only results in 60% unique results compared to the previous code's performance of 100% (1:38:36 PM) holmes: We have found the current algorithm only considers the last ~6 characters of a string, which may be the problem. That is the last thing I will say here I promise. (1:42:21 PM) kumpera: marek: how have you tested the distribution of your new string hashing algorithm? (1:42:54 PM) marek: kumpera: I used english dictionary (1:43:57 PM) marek: holmes: how big is your dictionary? (1:45:05 PM) holmes: marek: 58110 (1:46:12 PM) holmes: I think the real problem is not that a word hashes unique enough as a sentence producing a unique result. (1:46:56 PM) holmes: the last 6 characters define the whole algorithm. (or at least I think so) (1:56:50 PM) lupus: it's likely better to revert the change, string hashing is also supposed to be synced with the implementation in C in the runtime anyway -bill ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
On 12/01/07 Robert Jordan wrote: Since strings are immutable, shouldn't it be possible to compute the hashcode once and store it rather than recomputing it over and over again? Is there some really obvious reason that i can't think of that would make this a bad idea? This idea already came up a couple of years ago (2002 or so). [...] Therefore, the optimization should consider the string length and append the hash code after the string's bytes only if its length is greater than some amount (1K or so). I'll reply to this mail and try to summarize all the responses and the constraints we need to obey. 1) we can't change the C-side MonoString struct, because it is part of the Mono ABI 2) we should try to not increase memory usage 3) we should try to not slowdown the case were the hashcode is calculated for the first time on a string instance 4) we can't use utf8 to hold string chars because this would break the mscorlib/CLI ABI (see how fixed on a string is compiled by the C# compiler, for example) 5) GetHashCode should never be called for a string that is not yet fully built (like in StringBuilder), so there is no worry aout the string changing after the hash code field has been set Point 1 means that if we were to add a new field to the string it can only be added at the end, after the character data (and the terminating null char). This also means that if we use an int, we have to be careful and properly align it (this also means more instructions to access this field). About memory usage: objects in Mono are always aligned on 8 byte boundaries so in some cases the overhead is 50%, 33%, 25%, 20% (for strings up to 15 chars) and sometimes it is 0. I would say that, as a mean, the memory overhead is 15% for up to 20 chars, 5% for up to 100 chars and it doesn't matter above that. After having written the moving GC, where fast computing of an object's size is important, I would also try to avoid allocating the extra field only for big strings (this would also slowdown string allocation somewhat). So, given the above constraints and considerations, I suggest the following: 1) wait for the moving GC to become the default: in that case we can use the same mechanism used to store object hash codes in the object header: this means no extra memory usage for strings and just a small slowdown when inserting the hashcode in the header (it needs a locked op). Implementing this mechanism with the current GC would slowdown a bit some common operations (lock/unlock), with the moving GC we would take the hit anyway. 2) in alternative, use a 16 bit hash code field: the memory overhead is smaller (5-10% for up to 20 chars, 2-3% for up to 100 chars on average), it requires no alignment calculation so it's fast to access. The drawback is that for very big hashtables ( 100 K entries) we'd have few bits. A cheap way to increase bits would be to or the hash with the first char shifted left 16 bits (but this wouldn't solve the issue for strings all beginning with the same char). I'm sure this drawback is not very significant, but I'm also sure someone will write a specific benchmark to exploit this weakness and whine in some blog or on the list about it:) 2 is very simple to implement so we could easily get some numbers (hint, hint). lupus -- - [EMAIL PROTECTED] debian/rules [EMAIL PROTECTED] Monkeys do it better ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
Alan McGovern wrote: A thought struck me while i was dozing on the plane on the way home from the summit. Since strings are immutable, shouldn't it be possible to compute the hashcode once and store it rather than recomputing it over and over again? Is there some really obvious reason that i can't think of that would make this a bad idea? This idea already came up a couple of years ago (2002 or so). If memory serves (I was not able find the thread in the archives), the drawback was the additional gint32 that every MonoString* will get, regardless how small it is. Someone (Paolo?) pointed out that adding this field will increase the average string size by some unacceptable amount. Therefore, the optimization should consider the string length and append the hash code after the string's bytes only if its length is greater than some amount (1K or so). Robert ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
(Woops, only replied to kamil) If Jerome is right and the overhead is only 4 bytes, then overhead shouldn't be a problem at all. The worst case size of a string would be 1 character, of 2 bytes + something to end it with, like an int containing its length, 2 bytes, or a terminating character, probably 2 bytes too. Making it at least 4 bytes. A worst case scenario of 100% extra data, that isn't so bad right? It's way less than the overhead of using unicode, since you almost never use the enormous range of character 2-3 bytes offers you. Speaking of unicode/utf-16 and memory, couldn't a lot of memory be spared if strings where stored as utf8 internally, which would be converted back to utf-16 when more than 256 different characters would be used? All you would need is a small translation table, and the view instructions it takes to look a char up in it. Or is this already being done? Just curious. Tinco Date: Fri, 30 Nov 2007 19:10:07 -0800 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: mono-devel-list@lists.ximian.com Subject: Re: [Mono-dev] String.GetHashCode() Extra memory cost, which would hit all allocated strings, also those short ones. For some applications, which use millions of small strings this would be unacceptable hit. 2007/11/30, Alan McGovern [EMAIL PROTECTED]: A thought struck me while i was dozing on the plane on the way home from the summit. Since strings are immutable, shouldn't it be possible to compute the hashcode once and store it rather than recomputing it over and over again? Is there some really obvious reason that i can't think of that would make this a bad idea? Alan. ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list -- Kamil Skalski http://nazgul.omega.pl ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list Express yourself instantly with MSN Messenger! MSN Messenger _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
Also, just looking at the string source a bit more closely, it has a GetCaseInsensitiveHashcode method too, so i'd assume that would need to be cached too which would mean 8 bytes would be needed. This wouldn't scale well. Fair enough. Twas just an idea. Alan. On Dec 1, 2007 4:09 PM, Robert Jordan [EMAIL PROTECTED] wrote: Tinco Andringa wrote: (Woops, only replied to kamil) If Jerome is right and the overhead is only 4 bytes, then overhead shouldn't be a problem at all. The worst case size of a string would be 1 character, of 2 bytes + something to end it with, like an int containing its length, 2 bytes, or a terminating character, probably 2 bytes too. Making it at least 4 bytes. A worst case scenario of Look at a heavy string consumer: [g]mcs. The average string length it has to process is probably only 4-5 chars long. That's roundabout 12 bytes. Adding 4 bytes for the hash code is a huge overhead that only pays out if GetHashCode is called frequently, but this is definitely not a common scenario for most of the strings. Robert ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
Hi Alan, Alan McGovern wrote: Also, just looking at the string source a bit more closely, it has a GetCaseInsensitiveHashcode method too, so i'd assume that would need to be cached too which would mean 8 bytes would be needed. This wouldn't scale well. Fair enough. Twas just an idea. nah, don't give up too early! :) If the string is longer than n, you could allocate 4 or even 8 additional bytes at the end of the array for lazily computed hash codes. Open questions: - What's the optimal n for all common application classes mono is used for? :) It looks like it should be dynamically computed adjusted based on profiling data collected at runtime. - What's the real gain of this optimization? If n tends to be really large (say 1-4KB), what's the probability of such large strings being used as a hash key? This question can be reduced to how frequently is s.GetHashCode called, where s.Length n. In mono's class libs sources there is only one place (I know of) where large strings are potentially hashed: Sys.Web's cache management. Robert ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
Well, 'N' could be computed by asking what percent bloat is 'OK' or it could be computed by asking 'What % speedup do we want to make this a worthwhile tradeoff'. Currently a string has these two fields, an int and a char. Add in the object overhead of (i think) 12 bytes per object, you're up to 18 bytes straight off for a zero length string. Lets assume that you're only going to precalculate the hashcode for strings greater than length 4. So, take the case of a string of length 5. This puts the size of the string at 18 + 5*2 = 28 bytes. Adding 4 bytes to this is 15% extra memory usage. As string length goes up, this % goes down. I'm unsure what % speedup precalculating the hashcode would result in, but i'm sure someone could run a few benchmarks if they have the time free. You could say that you want at least a 5x or maybe 10x improvement before it becomes worthwhile, this would then put a lower bound on the string length to make it worthwhile. Alan. On Dec 1, 2007 5:29 PM, Robert Jordan [EMAIL PROTECTED] wrote: Hi Alan, Alan McGovern wrote: Also, just looking at the string source a bit more closely, it has a GetCaseInsensitiveHashcode method too, so i'd assume that would need to be cached too which would mean 8 bytes would be needed. This wouldn't scale well. Fair enough. Twas just an idea. nah, don't give up too early! :) If the string is longer than n, you could allocate 4 or even 8 additional bytes at the end of the array for lazily computed hash codes. Open questions: - What's the optimal n for all common application classes mono is used for? :) It looks like it should be dynamically computed adjusted based on profiling data collected at runtime. - What's the real gain of this optimization? If n tends to be really large (say 1-4KB), what's the probability of such large strings being used as a hash key? This question can be reduced to how frequently is s.GetHashCode called, where s.Length n. In mono's class libs sources there is only one place (I know of) where large strings are potentially hashed: Sys.Web's cache management. Robert ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
Also, worst case scenario for a zero length string would mean a 22% increase, not a 100% increase as was said before. Alan. On Dec 1, 2007 6:01 PM, Alan McGovern [EMAIL PROTECTED] wrote: Well, 'N' could be computed by asking what percent bloat is 'OK' or it could be computed by asking 'What % speedup do we want to make this a worthwhile tradeoff'. Currently a string has these two fields, an int and a char. Add in the object overhead of (i think) 12 bytes per object, you're up to 18 bytes straight off for a zero length string. Lets assume that you're only going to precalculate the hashcode for strings greater than length 4. So, take the case of a string of length 5. This puts the size of the string at 18 + 5*2 = 28 bytes. Adding 4 bytes to this is 15% extra memory usage. As string length goes up, this % goes down. I'm unsure what % speedup precalculating the hashcode would result in, but i'm sure someone could run a few benchmarks if they have the time free. You could say that you want at least a 5x or maybe 10x improvement before it becomes worthwhile, this would then put a lower bound on the string length to make it worthwhile. Alan. On Dec 1, 2007 5:29 PM, Robert Jordan [EMAIL PROTECTED] wrote: Hi Alan, Alan McGovern wrote: Also, just looking at the string source a bit more closely, it has a GetCaseInsensitiveHashcode method too, so i'd assume that would need to be cached too which would mean 8 bytes would be needed. This wouldn't scale well. Fair enough. Twas just an idea. nah, don't give up too early! :) If the string is longer than n, you could allocate 4 or even 8 additional bytes at the end of the array for lazily computed hash codes. Open questions: - What's the optimal n for all common application classes mono is used for? :) It looks like it should be dynamically computed adjusted based on profiling data collected at runtime. - What's the real gain of this optimization? If n tends to be really large (say 1-4KB), what's the probability of such large strings being used as a hash key? This question can be reduced to how frequently is s.GetHashCode called, where s.Length n. In mono's class libs sources there is only one place (I know of) where large strings are potentially hashed: Sys.Web's cache management. Robert ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
It doesn't make much sense to pre-compute the hashcode for a string whose length is not greater than the cacheline size. Anything that fits into the cacheline and is not memory bound is plenty fast enough. For the same reason, the size and hashcode of a string would need to be juxtaposed in memory to load both on first hit. Otherwise, the benefits are again wiped out due to the high cost of loading data from memory into the cache. If optimization at that level isn't interesting (b/c we talking nanoseconds in this realm), then the bound for N will go up quite quickly before a cached hashcode makes sense. And as Robert so insightfully pointed out, strings for which a pre-computed hashcode might make sense may never be used as keys. Hence, the complexity increase in th string code -- and therefore quality control headache -- introduced by it might not be worth the effort. Just some assembly hacker thoughts on the subject... :) - Steve -- Steve G. Bjorg http://wiki.mindtouch.com http://wiki.opengarden.org On Dec 1, 2007, at 10:02 AM, Alan McGovern wrote: Also, worst case scenario for a zero length string would mean a 22% increase, not a 100% increase as was said before. Alan. On Dec 1, 2007 6:01 PM, Alan McGovern [EMAIL PROTECTED] wrote: Well, 'N' could be computed by asking what percent bloat is 'OK' or it could be computed by asking 'What % speedup do we want to make this a worthwhile tradeoff'. Currently a string has these two fields, an int and a char. Add in the object overhead of (i think) 12 bytes per object, you're up to 18 bytes straight off for a zero length string. Lets assume that you're only going to precalculate the hashcode for strings greater than length 4. So, take the case of a string of length 5. This puts the size of the string at 18 + 5*2 = 28 bytes. Adding 4 bytes to this is 15% extra memory usage. As string length goes up, this % goes down. I'm unsure what % speedup precalculating the hashcode would result in, but i'm sure someone could run a few benchmarks if they have the time free. You could say that you want at least a 5x or maybe 10x improvement before it becomes worthwhile, this would then put a lower bound on the string length to make it worthwhile. Alan. On Dec 1, 2007 5:29 PM, Robert Jordan [EMAIL PROTECTED] wrote: Hi Alan, Alan McGovern wrote: Also, just looking at the string source a bit more closely, it has a GetCaseInsensitiveHashcode method too, so i'd assume that would need to be cached too which would mean 8 bytes would be needed. This wouldn't scale well. Fair enough. Twas just an idea. nah, don't give up too early! :) If the string is longer than n, you could allocate 4 or even 8 additional bytes at the end of the array for lazily computed hash codes. Open questions: - What's the optimal n for all common application classes mono is used for? :) It looks like it should be dynamically computed adjusted based on profiling data collected at runtime. - What's the real gain of this optimization? If n tends to be really large (say 1-4KB), what's the probability of such large strings being used as a hash key? This question can be reduced to how frequently is s.GetHashCode called, where s.Length n. In mono's class libs sources there is only one place (I know of) where large strings are potentially hashed: Sys.Web's cache management. Robert ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
On Sat, 2007-12-01 at 18:01 +, Alan McGovern wrote: Well, 'N' could be computed by asking what percent bloat is 'OK' or it could be computed by asking 'What % speedup do we want to make this a worthwhile tradeoff'. Wouldn't be this a good case for desktop vs server runtime mode? Which Mono already supports (mono --desktop) but not really differentiates AFAIK. -- Regards, Mirco 'meebey' Bauer PGP-Key ID: 0xEEF946C8 FOSS Developer[EMAIL PROTECTED] http://www.meebey.net/ PEAR Developer[EMAIL PROTECTED] http://pear.php.net/ Debian Developer [EMAIL PROTECTED] http://www.debian.org/ signature.asc Description: This is a digitally signed message part ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
[Mono-dev] String.GetHashCode()
A thought struck me while i was dozing on the plane on the way home from the summit. Since strings are immutable, shouldn't it be possible to compute the hashcode once and store it rather than recomputing it over and over again? Is there some really obvious reason that i can't think of that would make this a bad idea? Alan. ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
4 bytes of storage for every string on the system at all times? Dunno. Add it up I guess. On Sat, 2007-12-01 at 01:23 +, Alan McGovern wrote: A thought struck me while i was dozing on the plane on the way home from the summit. Since strings are immutable, shouldn't it be possible to compute the hashcode once and store it rather than recomputing it over and over again? Is there some really obvious reason that i can't think of that would make this a bad idea? Alan. ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] String.GetHashCode()
Extra memory cost, which would hit all allocated strings, also those short ones. For some applications, which use millions of small strings this would be unacceptable hit. 2007/11/30, Alan McGovern [EMAIL PROTECTED]: A thought struck me while i was dozing on the plane on the way home from the summit. Since strings are immutable, shouldn't it be possible to compute the hashcode once and store it rather than recomputing it over and over again? Is there some really obvious reason that i can't think of that would make this a bad idea? Alan. ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list -- Kamil Skalski http://nazgul.omega.pl ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list