Re: [Mono-dev] String.GetHashCode on Mac

2008-12-02 Thread Bill Holmes
David,

We may have hit this before.  See
http://www.nabble.com/String.GetHashCode-Discussion.-to18496625ef1367.html#a18496625

I lost tack of this issue after that thread but maybe it will help
someone else remember.

-bill

On Mon, Dec 1, 2008 at 9:25 AM, David Suarez
[EMAIL PROTECTED] wrote:
 Hi,

 I've run into a problem with the String.GetHashCode method in macosx
 (10.5.3, intel). The point is I get the same hashcode for very different
 strings, apparently only taking into account the last 7 chars of the
 string. This happens with mono 2.0 installer.

 Consider this test case:

 using System;

 namespace hashtest
 {
class Class2
{
 private static string[] testStrings2 = new string[]
{
something to worry about,
ing to worry about,
to worry about,
orry about,
y about,
 about
};

[STAThread]
static void Main(string[] args)
{
foreach (string s in testStrings2) PrintHash(s,
 s.GetHashCode());
}

private static void PrintHash(string str, int hash)
{
Console.WriteLine(  hash [{1}] line: [{0}], str,
  hash.ToString(X8));
}
}
 }

 Running on mac, I get this result:

  hash [9DA57994] line: [something to worry about]
  hash [9DA57994] line: [ing to worry about]
  hash [9DA57994] line: [to worry about]
  hash [9DA57994] line: [orry about]
  hash [9DA57994] line: [y about]
  hash [1DA57994] line: [ about]

 It seems to me that the String.GetHashCode in String.cs is not getting
 called, because this method works fine (I tested separately). Is this
 happening to anybody else? Any clue on what code is being called on the
 mac for getting the hash code of strings?

 Thanks,

 David

 Is this happening to anybody else??

 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


[Mono-dev] String.GetHashCode on Mac

2008-12-01 Thread David Suarez
Hi,

I've run into a problem with the String.GetHashCode method in macosx 
(10.5.3, intel). The point is I get the same hashcode for very different 
strings, apparently only taking into account the last 7 chars of the 
string. This happens with mono 2.0 installer.

Consider this test case:

using System;

namespace hashtest
{
class Class2
{
 private static string[] testStrings2 = new string[]
{
something to worry about,
ing to worry about,
to worry about,
orry about,
y about,
 about
};

[STAThread]
static void Main(string[] args)
{
foreach (string s in testStrings2) PrintHash(s, 
s.GetHashCode());
}

private static void PrintHash(string str, int hash)
{
Console.WriteLine(  hash [{1}] line: [{0}], str,
  hash.ToString(X8));
}
}
}

Running on mac, I get this result:

  hash [9DA57994] line: [something to worry about]
  hash [9DA57994] line: [ing to worry about]
  hash [9DA57994] line: [to worry about]
  hash [9DA57994] line: [orry about]
  hash [9DA57994] line: [y about]
  hash [1DA57994] line: [ about]

It seems to me that the String.GetHashCode in String.cs is not getting 
called, because this method works fine (I tested separately). Is this 
happening to anybody else? Any clue on what code is being called on the 
mac for getting the hash code of strings?

Thanks,

David

Is this happening to anybody else??

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode Discussion.

2008-07-21 Thread Marek Safar
Hello Andreas,
 Imho we should revert the change, especially as MS seems to also have an
 implementation with low collisions (and in fact this will be what people are
 expecting from a hashcode without any further explanation) we should do the
 same. Otherwise this might drive some implementations into huge perf
 problems.
   
I reverted my changes for now to produce same bad/good results. However, 
I tried to port another implementation which seems to produce much more 
better results for both scenarios but it still needs more testing. I'll 
send a patch for review when I am done.

Marek
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode Discussion.

2008-07-21 Thread Alan McGovern
Hey,

With the recent talk on GetHashCode(), I was just taking a look at the code.
I decided to try my hand at seeing what would happen performancewise if i
made the function work with an int* as opposed to char*. Turns out i ended
up with something which was 50% slower, but also had 50% less collisions on
a rather large english dictionary dataset (17mb of words, probably a few
dupes).

Just in case it's of any use, i've attached it here, though i'd say there
are much better ones out there. This was purely accidental ;)


public unsafe static int GetHashCode(string s)
{
int h = 0;
int rem = s.Length  3;
fixed (char* c = s)
{
int* start = (int*)c;
int* end = start + s.Length - rem;

for (int* i = end; i = start; i--)
h = (h  5) - h + *i;

char* cc = c + s.Length - rem;
for (char* i = cc + rem; i = cc; i--)
h = (h  5) - h + *i;

return h;
}
}

On Mon, Jul 21, 2008 at 12:33 PM, Marek Safar [EMAIL PROTECTED] wrote:

 Hello Andreas,
  Imho we should revert the change, especially as MS seems to also have an
  implementation with low collisions (and in fact this will be what people
 are
  expecting from a hashcode without any further explanation) we should do
 the
  same. Otherwise this might drive some implementations into huge perf
  problems.
 
 I reverted my changes for now to produce same bad/good results. However,
 I tried to port another implementation which seems to produce much more
 better results for both scenarios but it still needs more testing. I'll
 send a patch for review when I am done.

 Marek
 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


[Mono-dev] String.GetHashCode Discussion.

2008-07-16 Thread Bill Holmes
I have been asked to move this discussion to the e-mail list from IRC.
 Basically we (my company and I) have new unit tests errors in 2.0
that did not occur at 1.9.  The errors were traced to the
String.GetHashCode change.  I had one of our interns (Mike) research
the change and I wanted to share his findings with you.  His final
document s attached.  I can also send the test case Mike used to
generate his data.  I believe it is too large to send to all of your
inboxes.

http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm

(1:28:15 PM) holmes: Have there been any discussions about r106887
which was a change to String.GetHashCode?
(1:31:54 PM) holmes: This change is causing some of our tests to fail.
 I will admit that our tests are relying on the sting hash code a bit
too much.  However after some investigation I am wondering if the new
algorithm is generating a unique enough result.  (I say that as
respectfully as I can.)
(1:34:33 PM) holmes: We conducted a short test where 58000 English
string were tested and the new code only results in 60% unique results
compared to the previous code's performance of 100%
(1:38:36 PM) holmes: We have found the current algorithm only
considers the last ~6 characters of a string, which may be the
problem.  That is the last thing I will say here I promise.
(1:42:21 PM) kumpera: marek: how have you tested the distribution of
your new string hashing algorithm?
(1:42:54 PM) marek: kumpera: I used english dictionary
(1:43:57 PM) marek: holmes: how big is your dictionary?
(1:45:05 PM) holmes: marek: 58110
(1:46:12 PM) holmes: I think the real problem is not that a word
hashes unique enough as a sentence producing a unique result.
(1:46:56 PM) holmes: the last 6 characters define the whole algorithm.
 (or at least I think so)
(1:56:50 PM) lupus: it's likely better to revert the change, string
hashing is also supposed to be synced with the implementation in C in
the runtime anyway

-bill
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode Discussion.

2008-07-16 Thread Marek Safar

Hello,

Here are some data which hopefully bring some light to this topic. I 
didn't measure uniqueness of hashcodes as I consider it less important 
as range distribution. Enclosed you can find the result for you data 
with old and new string.GetHashCode implementation.


Some numbers to look at:

** Old implementation

Highest contentions: 2985, 1155, 1137

** Current implementation

Highest contentions: 236, 217, 210

Marek



I have been asked to move this discussion to the e-mail list from IRC.
 Basically we (my company and I) have new unit tests errors in 2.0
that did not occur at 1.9.  The errors were traced to the
String.GetHashCode change.  I had one of our interns (Mike) research
the change and I wanted to share his findings with you.  His final
document s attached.  I can also send the test case Mike used to
generate his data.  I believe it is too large to send to all of your
inboxes.

http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm

(1:28:15 PM) holmes: Have there been any discussions about r106887
which was a change to String.GetHashCode?
(1:31:54 PM) holmes: This change is causing some of our tests to fail.
 I will admit that our tests are relying on the sting hash code a bit
too much.  However after some investigation I am wondering if the new
algorithm is generating a unique enough result.  (I say that as
respectfully as I can.)
(1:34:33 PM) holmes: We conducted a short test where 58000 English
string were tested and the new code only results in 60% unique results
compared to the previous code's performance of 100%
(1:38:36 PM) holmes: We have found the current algorithm only
considers the last ~6 characters of a string, which may be the
problem.  That is the last thing I will say here I promise.
(1:42:21 PM) kumpera: marek: how have you tested the distribution of
your new string hashing algorithm?
(1:42:54 PM) marek: kumpera: I used english dictionary
(1:43:57 PM) marek: holmes: how big is your dictionary?
(1:45:05 PM) holmes: marek: 58110
(1:46:12 PM) holmes: I think the real problem is not that a word
hashes unique enough as a sentence producing a unique result.
(1:46:56 PM) holmes: the last 6 characters define the whole algorithm.
 (or at least I think so)
(1:56:50 PM) lupus: it's likely better to revert the change, string
hashing is also supposed to be synced with the implementation in C in
the runtime anyway

-bill
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

  


Range from,Range To,Old Implementation,New Implementation
-2147483647,-2143289343,35,42
-2143289343,-2139095039,44,36
-2139095039,-2134900735,47,66
-2134900735,-2130706431,98,71
-2130706431,-2126512127,32,42
-2126512127,-2122317823,85,62
-2122317823,-2118123519,51,73
-2118123519,-2113929215,44,60
-2113929215,-2109734911,44,45
-2109734911,-2105540607,26,127
-2105540607,-2101346303,43,88
-2101346303,-2097151999,42,83
-2097151999,-2092957695,52,55
-2092957695,-2088763391,31,26
-2088763391,-2084569087,39,38
-2084569087,-2080374783,34,47
-2080374783,-2076180479,31,63
-2076180479,-2071986175,38,21
-2071986175,-2067791871,40,32
-2067791871,-2063597567,28,29
-2063597567,-2059403263,51,18
-2059403263,-2055208959,33,26
-2055208959,-2051014655,29,46
-2051014655,-2046820351,26,128
-2046820351,-2042626047,35,48
-2042626047,-2038431743,43,54
-2038431743,-2034237439,26,52
-2034237439,-2030043135,39,38
-2030043135,-2025848831,44,46
-2025848831,-2021654527,45,29
-2021654527,-2017460223,31,79
-2017460223,-2013265919,39,79
-2013265919,-2009071615,65,96
-2009071615,-2004877311,61,77
-2004877311,-2000683007,69,35
-2000683007,-1996488703,58,57
-1996488703,-1992294399,58,66
-1992294399,-1988100095,25,51
-1988100095,-1983905791,38,47
-1983905791,-1979711487,30,70
-1979711487,-1975517183,38,49
-1975517183,-1971322879,29,30
-1971322879,-1967128575,32,39
-1967128575,-1962934271,83,48
-1962934271,-1958739967,43,59
-1958739967,-1954545663,26,42
-1954545663,-1950351359,26,33
-1950351359,-1946157055,26,68
-1946157055,-1941962751,26,32
-1941962751,-1937768447,24,50
-1937768447,-1933574143,38,44
-1933574143,-1929379839,33,29
-1929379839,-1925185535,49,92
-1925185535,-1920991231,35,43
-1920991231,-1916796927,31,38
-1916796927,-1912602623,30,29
-1912602623,-1908408319,33,40
-1908408319,-1904214015,22,54
-1904214015,-1900019711,37,43
-1900019711,-1895825407,92,33
-1895825407,-1891631103,41,65
-1891631103,-1887436799,72,72
-1887436799,-1883242495,70,61
-1883242495,-1879048191,52,100
-1879048191,-1874853887,47,115
-1874853887,-1870659583,19,119
-1870659583,-1866465279,58,45
-1866465279,-1862270975,39,49
-1862270975,-1858076671,44,56
-1858076671,-1853882367,63,41
-1853882367,-1849688063,53,33
-1849688063,-1845493759,42,87
-1845493759,-1841299455,30,34
-1841299455,-1837105151,32,34
-1837105151,-1832910847,48,217
-1832910847,-1828716543,41,59
-1828716543,-1824522239,26,42
-1824522239,-1820327935,30,99
-1820327935,-1816133631,24,27

Re: [Mono-dev] String.GetHashCode Discussion.

2008-07-16 Thread Zoltan Varga
 Hi,

  Where is this hashcode implementation taken from ? I don't think we
should invent new ones.

   Zoltan

2008/7/17 Marek Safar [EMAIL PROTECTED]:
 Hello,

 Here are some data which hopefully bring some light to this topic. I didn't
 measure uniqueness of hashcodes as I consider it less important as range
 distribution. Enclosed you can find the result for you data with old and new
 string.GetHashCode implementation.

 Some numbers to look at:

 ** Old implementation

 Highest contentions: 2985, 1155, 1137

 ** Current implementation

 Highest contentions: 236, 217, 210

 Marek


 I have been asked to move this discussion to the e-mail list from IRC.
  Basically we (my company and I) have new unit tests errors in 2.0
 that did not occur at 1.9.  The errors were traced to the
 String.GetHashCode change.  I had one of our interns (Mike) research
 the change and I wanted to share his findings with you.  His final
 document s attached.  I can also send the test case Mike used to
 generate his data.  I believe it is too large to send to all of your
 inboxes.

 http://docs.google.com/Doc?id=dhnnms3p_3fpmfq8dm

 (1:28:15 PM) holmes: Have there been any discussions about r106887
 which was a change to String.GetHashCode?
 (1:31:54 PM) holmes: This change is causing some of our tests to fail.
  I will admit that our tests are relying on the sting hash code a bit
 too much.  However after some investigation I am wondering if the new
 algorithm is generating a unique enough result.  (I say that as
 respectfully as I can.)
 (1:34:33 PM) holmes: We conducted a short test where 58000 English
 string were tested and the new code only results in 60% unique results
 compared to the previous code's performance of 100%
 (1:38:36 PM) holmes: We have found the current algorithm only
 considers the last ~6 characters of a string, which may be the
 problem.  That is the last thing I will say here I promise.
 (1:42:21 PM) kumpera: marek: how have you tested the distribution of
 your new string hashing algorithm?
 (1:42:54 PM) marek: kumpera: I used english dictionary
 (1:43:57 PM) marek: holmes: how big is your dictionary?
 (1:45:05 PM) holmes: marek: 58110
 (1:46:12 PM) holmes: I think the real problem is not that a word
 hashes unique enough as a sentence producing a unique result.
 (1:46:56 PM) holmes: the last 6 characters define the whole algorithm.
  (or at least I think so)
 (1:56:50 PM) lupus: it's likely better to revert the change, string
 hashing is also supposed to be synced with the implementation in C in
 the runtime anyway

 -bill
 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list




 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list


___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-03 Thread Paolo Molaro
On 12/01/07 Robert Jordan wrote:
  Since strings are immutable, shouldn't it be possible to compute the
  hashcode once and store it rather than recomputing it over and over again?
  Is there some really obvious reason that i can't think of that would make
  this a bad idea?
 
 This idea already came up a couple of years ago (2002 or so).
[...]
 Therefore, the optimization should consider the string length
 and append the hash code after the string's bytes
 only if its length is greater than some amount (1K or so).

I'll reply to this mail and try to summarize all the responses and
the constraints we need to obey.

1) we can't change the C-side MonoString struct, because it is part of
the Mono ABI
2) we should try to not increase memory usage
3) we should try to not slowdown the case were the hashcode is
calculated for the first time on a string instance
4) we can't use utf8 to hold string chars because this would break
the mscorlib/CLI ABI (see how fixed on a string is compiled by the C#
compiler, for example)
5) GetHashCode should never be called for a string that is not yet
fully built (like in StringBuilder), so there is no worry aout the
string changing after the hash code field has been set

Point 1 means that if we were to add a new field to the string it can
only be added at the end, after the character data (and the terminating
null char). This also means that if we use an int, we have to be careful
and properly align it (this also means more instructions to access this
field).

About memory usage: objects in Mono are always aligned on 8 byte
boundaries so in some cases the overhead is 50%, 33%, 25%, 20%
(for strings up to 15 chars) and sometimes it is 0. I would
say that, as a mean, the memory overhead is 15% for up to 20 chars,
5% for up to 100 chars and it doesn't matter above that.

After having written the moving GC, where fast computing of an object's
size is important, I would also try to avoid allocating the extra field
only for big strings (this would also slowdown string allocation
somewhat).

So, given the above constraints and considerations, I suggest the
following:
1) wait for the moving GC to become the default: in that case we can use
the same mechanism used to store object hash codes in the object header:
this means no extra memory usage for strings and just a small slowdown
when inserting the hashcode in the header (it needs a locked op).
Implementing this mechanism with the current GC would slowdown a bit
some common operations (lock/unlock), with the moving GC we would take
the hit anyway.
2) in alternative, use a 16 bit hash code field: the memory overhead is
smaller (5-10% for up to 20 chars, 2-3% for up to 100 chars on average),
it requires no alignment calculation so it's fast to access. The
drawback is that for very big hashtables ( 100 K entries) we'd have few
bits. A cheap way to increase bits would be to or the hash with the
first char shifted left 16 bits (but this wouldn't solve the issue for
strings all beginning with the same char). I'm sure this drawback is not
very significant, but I'm also sure someone will write a specific
benchmark to exploit this weakness and whine in some blog or on the list
about it:)

2 is very simple to implement so we could easily get some numbers (hint, hint).

lupus

-- 
-
[EMAIL PROTECTED] debian/rules
[EMAIL PROTECTED] Monkeys do it better
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Robert Jordan
Alan McGovern wrote:
 A thought struck me while i was dozing on the plane on the way home from the
 summit.
 
 Since strings are immutable, shouldn't it be possible to compute the
 hashcode once and store it rather than recomputing it over and over again?
 Is there some really obvious reason that i can't think of that would make
 this a bad idea?

This idea already came up a couple of years ago (2002 or so).

If memory serves (I was not able find the thread in the archives),
the drawback was the additional gint32 that every MonoString*
will get, regardless how small it is.

Someone (Paolo?) pointed out that adding this field will increase
the average string size by some unacceptable amount.

Therefore, the optimization should consider the string length
and append the hash code after the string's bytes
only if its length is greater than some amount (1K or so).

Robert

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Tinco Andringa

(Woops, only replied to kamil)
 
If Jerome is right and the overhead is only 4 bytes, then overhead shouldn't be 
a problem at all. The worst case size of a string would be 1 character, of 2 
bytes + something to end it with, like an int containing its length, 2 bytes, 
or a terminating character, probably 2 bytes too. Making it at least 4 bytes.  
A worst case scenario of 100% extra data, that isn't so bad right? It's way 
less than the overhead of using unicode, since you almost never use the 
enormous range of character 2-3 bytes offers you. Speaking of unicode/utf-16 
and memory, couldn't a lot of memory be spared if strings where stored as utf8 
internally, which would be converted back to utf-16 when more than 256 
different characters would be used? All you would need is a small translation 
table, and the view instructions it takes to look a char up in it. Or is this 
already being done? Just curious. Tinco
 
 Date: Fri, 30 Nov 2007 19:10:07 -0800 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] CC: mono-devel-list@lists.ximian.com Subject: Re: [Mono-dev] 
 String.GetHashCode()  Extra memory cost, which would hit all allocated 
 strings, also those short ones. For some applications, which use millions of 
 small strings this would be unacceptable hit.  2007/11/30, Alan McGovern 
 [EMAIL PROTECTED]:  A thought struck me while i was dozing on the plane 
 on the way home from the  summit.   Since strings are immutable, 
 shouldn't it be possible to compute the  hashcode once and store it rather 
 than recomputing it over and over again?  Is there some really obvious 
 reason that i can't think of that would make  this a bad idea?   Alan. 
   ___  Mono-devel-list 
 mailing list  Mono-devel-list@lists.ximian.com  
 http://lists.ximian.com/mailman/listinfo/mono-devel-list --  
 Kamil Skalski http://nazgul.omega.pl 
 ___ Mono-devel-list mailing 
 list Mono-devel-list@lists.ximian.com 
 http://lists.ximian.com/mailman/listinfo/mono-devel-list



Express yourself instantly with MSN Messenger! MSN Messenger 
_
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Alan McGovern
Also, just looking at the string source a bit more closely, it has a
GetCaseInsensitiveHashcode method too, so i'd assume that would need to be
cached too which would mean 8 bytes would be needed. This wouldn't scale
well.

Fair enough. Twas just an idea.

Alan.

On Dec 1, 2007 4:09 PM, Robert Jordan [EMAIL PROTECTED] wrote:

 Tinco Andringa wrote:
  (Woops, only replied to kamil)
 
  If Jerome is right and the overhead is only 4 bytes, then overhead
  shouldn't be a problem at all. The worst case size of a string would
  be 1 character, of 2 bytes + something to end it with, like an int
  containing its length, 2 bytes, or a terminating character, probably
  2 bytes too. Making it at least 4 bytes.  A worst case scenario of

 Look at a heavy string consumer: [g]mcs. The average string
 length it has to process is probably only 4-5 chars long.
 That's roundabout 12 bytes. Adding 4 bytes for the hash code
 is a huge overhead that only pays out if GetHashCode is
 called frequently, but this is definitely not a common scenario
 for most of the strings.

 Robert

 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Robert Jordan
Hi Alan,

Alan McGovern wrote:
 Also, just looking at the string source a bit more closely, it has a
 GetCaseInsensitiveHashcode method too, so i'd assume that would need to be
 cached too which would mean 8 bytes would be needed. This wouldn't scale
 well.
 
 Fair enough. Twas just an idea.

nah, don't give up too early! :) If the string is longer than n,
you could allocate 4 or even 8 additional bytes at the end of the
array for lazily computed hash codes.

Open questions:

- What's the optimal n for all common application classes
   mono is used for? :) It looks like it should be dynamically
   computed  adjusted based on profiling data collected at
   runtime.

- What's the real gain of this optimization? If n tends to
   be really large (say 1-4KB), what's the probability of
   such large strings being used as a hash key?
   This question can be reduced to how frequently is
   s.GetHashCode called, where s.Length  n.

   In mono's class libs sources there is only one place (I know
   of) where large strings are potentially hashed: Sys.Web's
   cache management.

Robert

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Alan McGovern
Well, 'N' could be computed by asking what percent bloat is 'OK' or it could
be computed by asking 'What % speedup do we want to make this a worthwhile
tradeoff'.

Currently a string has these two fields, an int and a char. Add in the
object overhead of (i think) 12 bytes per object, you're up to 18 bytes
straight off for a zero length string. Lets assume that you're only going to
precalculate the hashcode for strings greater than length 4.

So, take the case of a string of length 5. This puts the size of the string
at 18 + 5*2 = 28 bytes. Adding 4 bytes to this is 15% extra memory usage. As
string length goes up, this % goes down.

I'm unsure what % speedup precalculating the hashcode would result in, but
i'm sure someone could run a few benchmarks if they have the time free. You
could say that you want at least a 5x or maybe 10x improvement before it
becomes worthwhile, this would then put a lower bound on the string length
to make it worthwhile.

Alan.

On Dec 1, 2007 5:29 PM, Robert Jordan [EMAIL PROTECTED] wrote:

 Hi Alan,

 Alan McGovern wrote:
  Also, just looking at the string source a bit more closely, it has a
  GetCaseInsensitiveHashcode method too, so i'd assume that would need to
 be
  cached too which would mean 8 bytes would be needed. This wouldn't scale
  well.
 
  Fair enough. Twas just an idea.

 nah, don't give up too early! :) If the string is longer than n,
 you could allocate 4 or even 8 additional bytes at the end of the
 array for lazily computed hash codes.

 Open questions:

 - What's the optimal n for all common application classes
   mono is used for? :) It looks like it should be dynamically
   computed  adjusted based on profiling data collected at
   runtime.

 - What's the real gain of this optimization? If n tends to
   be really large (say 1-4KB), what's the probability of
   such large strings being used as a hash key?
   This question can be reduced to how frequently is
   s.GetHashCode called, where s.Length  n.

   In mono's class libs sources there is only one place (I know
   of) where large strings are potentially hashed: Sys.Web's
   cache management.

 Robert

 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Alan McGovern
Also, worst case scenario for a zero length string would mean a 22%
increase, not a 100% increase as was said before.

Alan.

On Dec 1, 2007 6:01 PM, Alan McGovern [EMAIL PROTECTED] wrote:

 Well, 'N' could be computed by asking what percent bloat is 'OK' or it
 could be computed by asking 'What % speedup do we want to make this a
 worthwhile tradeoff'.

 Currently a string has these two fields, an int and a char. Add in the
 object overhead of (i think) 12 bytes per object, you're up to 18 bytes
 straight off for a zero length string. Lets assume that you're only going to
 precalculate the hashcode for strings greater than length 4.

 So, take the case of a string of length 5. This puts the size of the
 string at 18 + 5*2 = 28 bytes. Adding 4 bytes to this is 15% extra memory
 usage. As string length goes up, this % goes down.

 I'm unsure what % speedup precalculating the hashcode would result in, but
 i'm sure someone could run a few benchmarks if they have the time free. You
 could say that you want at least a 5x or maybe 10x improvement before it
 becomes worthwhile, this would then put a lower bound on the string length
 to make it worthwhile.

 Alan.


 On Dec 1, 2007 5:29 PM, Robert Jordan [EMAIL PROTECTED] wrote:

  Hi Alan,
 
  Alan McGovern wrote:
   Also, just looking at the string source a bit more closely, it has a
   GetCaseInsensitiveHashcode method too, so i'd assume that would need
  to be
   cached too which would mean 8 bytes would be needed. This wouldn't
  scale
   well.
  
   Fair enough. Twas just an idea.
 
  nah, don't give up too early! :) If the string is longer than n,
  you could allocate 4 or even 8 additional bytes at the end of the
  array for lazily computed hash codes.
 
  Open questions:
 
  - What's the optimal n for all common application classes
mono is used for? :) It looks like it should be dynamically
computed  adjusted based on profiling data collected at
runtime.
 
  - What's the real gain of this optimization? If n tends to
be really large (say 1-4KB), what's the probability of
such large strings being used as a hash key?
This question can be reduced to how frequently is
s.GetHashCode called, where s.Length  n.
 
In mono's class libs sources there is only one place (I know
of) where large strings are potentially hashed: Sys.Web's
cache management.
 
  Robert
 
  ___
  Mono-devel-list mailing list
  Mono-devel-list@lists.ximian.com
  http://lists.ximian.com/mailman/listinfo/mono-devel-list
 


___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Steve Bjorg
It doesn't make much sense to pre-compute the hashcode for a string  
whose length is not greater than the cacheline size.  Anything that  
fits into the cacheline and is not memory bound is plenty fast enough.


For the same reason, the size and hashcode of a string would need to  
be juxtaposed in memory to load both on first hit.  Otherwise, the  
benefits are again wiped out due to the high cost of loading data  
from memory into the cache.  If optimization at that level isn't  
interesting (b/c we talking nanoseconds in this realm), then the  
bound for N will go up quite quickly before a cached hashcode makes  
sense.  And as Robert so insightfully pointed out, strings for which  
a pre-computed hashcode might make sense may never be used as keys.   
Hence, the complexity increase in th string code -- and therefore  
quality control headache -- introduced by it might not be worth the  
effort.


Just some assembly hacker thoughts on the subject... :)


- Steve

--
Steve G. Bjorg
http://wiki.mindtouch.com
http://wiki.opengarden.org


On Dec 1, 2007, at 10:02 AM, Alan McGovern wrote:

Also, worst case scenario for a zero length string would mean a 22%  
increase, not a 100% increase as was said before.


Alan.

On Dec 1, 2007 6:01 PM, Alan McGovern  [EMAIL PROTECTED]  
wrote:
Well, 'N' could be computed by asking what percent bloat is 'OK' or  
it could be computed by asking 'What % speedup do we want to make  
this a worthwhile tradeoff'.


Currently a string has these two fields, an int and a char. Add in  
the object overhead of (i think) 12 bytes per object, you're up to  
18 bytes straight off for a zero length string. Lets assume that  
you're only going to precalculate the hashcode for strings greater  
than length 4.


So, take the case of a string of length 5. This puts the size of  
the string at 18 + 5*2 = 28 bytes. Adding 4 bytes to this is 15%  
extra memory usage. As string length goes up, this % goes down.


I'm unsure what % speedup precalculating the hashcode would result  
in, but i'm sure someone could run a few benchmarks if they have  
the time free. You could say that you want at least a 5x or maybe  
10x improvement before it becomes worthwhile, this would then put a  
lower bound on the string length to make it worthwhile.


Alan.


On Dec 1, 2007 5:29 PM, Robert Jordan [EMAIL PROTECTED]  wrote:
Hi Alan,

Alan McGovern wrote:
 Also, just looking at the string source a bit more closely, it has a
 GetCaseInsensitiveHashcode method too, so i'd assume that would  
need to be
 cached too which would mean 8 bytes would be needed. This  
wouldn't scale

 well.

 Fair enough. Twas just an idea.

nah, don't give up too early! :) If the string is longer than n,
you could allocate 4 or even 8 additional bytes at the end of the
array for lazily computed hash codes.

Open questions:

- What's the optimal n for all common application classes
  mono is used for? :) It looks like it should be dynamically
  computed  adjusted based on profiling data collected at
  runtime.

- What's the real gain of this optimization? If n tends to
  be really large (say 1-4KB), what's the probability of
  such large strings being used as a hash key?
  This question can be reduced to how frequently is
  s.GetHashCode called, where s.Length  n.

  In mono's class libs sources there is only one place (I know
  of) where large strings are potentially hashed: Sys.Web's
  cache management.

Robert

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-12-01 Thread Mirco Bauer

On Sat, 2007-12-01 at 18:01 +, Alan McGovern wrote:
 Well, 'N' could be computed by asking what percent bloat is 'OK' or it
 could be computed by asking 'What % speedup do we want to make this a
 worthwhile tradeoff'.

Wouldn't be this a good case for desktop vs server runtime mode?
Which Mono already supports (mono --desktop) but not really
differentiates AFAIK.

-- 
Regards,

Mirco 'meebey' Bauer

PGP-Key ID: 0xEEF946C8

FOSS Developer[EMAIL PROTECTED]  http://www.meebey.net/
PEAR Developer[EMAIL PROTECTED] http://pear.php.net/
Debian Developer  [EMAIL PROTECTED]  http://www.debian.org/


signature.asc
Description: This is a digitally signed message part
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


[Mono-dev] String.GetHashCode()

2007-11-30 Thread Alan McGovern
A thought struck me while i was dozing on the plane on the way home from the
summit.

Since strings are immutable, shouldn't it be possible to compute the
hashcode once and store it rather than recomputing it over and over again?
Is there some really obvious reason that i can't think of that would make
this a bad idea?

Alan.
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-11-30 Thread Jerome Haltom
4 bytes of storage for every string on the system at all times? Dunno.
Add it up I guess.

On Sat, 2007-12-01 at 01:23 +, Alan McGovern wrote:
 A thought struck me while i was dozing on the plane on the way home
 from the summit.
 
 Since strings are immutable, shouldn't it be possible to compute the
 hashcode once and store it rather than recomputing it over and over
 again? Is there some really obvious reason that i can't think of that
 would make this a bad idea? 
 
 Alan.
 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list


Re: [Mono-dev] String.GetHashCode()

2007-11-30 Thread Kamil Skalski
Extra memory cost, which would hit all allocated strings, also those
short ones. For some applications, which use millions of small strings
this would be unacceptable hit.

2007/11/30, Alan McGovern [EMAIL PROTECTED]:
 A thought struck me while i was dozing on the plane on the way home from the
 summit.

 Since strings are immutable, shouldn't it be possible to compute the
 hashcode once and store it rather than recomputing it over and over again?
 Is there some really obvious reason that i can't think of that would make
 this a bad idea?

 Alan.

 ___
 Mono-devel-list mailing list
 Mono-devel-list@lists.ximian.com
 http://lists.ximian.com/mailman/listinfo/mono-devel-list




-- 
Kamil Skalski
http://nazgul.omega.pl
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list