Hello Shad, Yes, your solution is better than mine. It certainly gets rid of these annoying exceptions. I'm still new to the "run lucene tests", so I admit I've not run the entire suite every time I change something. Part of the reason is that I'm using my machine for other work in the meantime. But a significant part of the reason is pure laziness and egoism, I'll admit it.
I'll better my life, I promise. Vincent -----Original Message----- From: Shad Storhaug [mailto:[email protected]] Sent: Friday, March 24, 2017 3:21 AM To: Van Den Berghe, Vincent <[email protected]> Cc: [email protected] Subject: RE: Proposal: UnicodeUtils.ToCharArray implementation Vincent, I put this in, but was getting exceptions because the array overflowed in the Lucene.Net.Misc.Util.Fst.TestFSTsMisc.TestRandomWords() test. I ended up changing it since it was highly optimized for non-surrogate pairs, but if using surrogate pairs it was copying arrays over and over to resize the array on every iteration in a tight loop. So, to compromise, we just double the count in cases where it is 1024 chars or less and do a loop to count the actual chars to set the array size exactly if it is over that. Running all of the tests in Analysis.Common used to take 14-16 minutes, now it takes 10-11 minutes. Now it looks like this: public static char[] ToCharArray(int[] codePoints, int offset, int count) { if (count < 0) { throw new System.ArgumentException(); } int countThreashhold = 1024; // If the number of chars exceeds this, we count them instead of allocating count * 2 // LUCENENET: as a first approximation, assume each codepoint // is 2 characters (since it cannot be longer than this) int arrayLength = count * 2; // LUCENENET: if we go over the threashhold, count the number of // chars we will need so we can allocate the precise amount of memory if (count > countThreashhold) { arrayLength = 0; for (int r = offset, e = offset + count; r < e; ++r) { arrayLength += codePoints[r] < 0x010000 ? 1 : 2; } if (arrayLength < 1) { arrayLength = count * 2; } } // Initialize our array to our exact or oversized length. // It is now safe to assume we have enough space for all of the characters. char[] chars = new char[arrayLength]; int w = 0; for (int r = offset, e = offset + count; r < e; ++r) { int cp = codePoints[r]; if (cp < 0 || cp > 0x10ffff) { throw new System.ArgumentException(); } if (cp < 0x010000) { chars[w++] = (char)cp; } else { chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_)); chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_)); } } var result = new char[w]; Array.Copy(chars, result, w); return result; } Anyway, if you think we need to adjust the threshold or if you have any other suggestions, let me know. Thanks, Shad Storhaug (NightOwl888) -----Original Message----- From: Van Den Berghe, Vincent [mailto:[email protected]] Sent: Wednesday, March 22, 2017 9:59 PM To: [email protected] Subject: Proposal: UnicodeUtils.ToCharArray implementation When the dust settles, I propose to rewrite UnicodeUtils.ToCharArray as follows: public static char[] ToCharArray(int[] codePoints, int offset, int count) { if (count < 0) { throw new System.ArgumentException(); } // as a first approximation, assume each codepoint is 1 character char[] chars = new char[count]; int w = 0; for (int r = offset, e = offset + count; r < e; ++r) { int cp = codePoints[r]; if (cp < 0 || cp > 0x10ffff) { throw new System.ArgumentException(); } if (cp < 0x010000) { chars[w++] = (char)cp; } else { chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_)); // if we need more room, add enough to hold 2-character codepoints for the rest of // the string. Normally, this resize operation will happen at most once. if (w >= chars.Length) Array.Resize(ref chars, chars.Length + (e - r) * 2 - 1); chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_)); } } // resize to the exact length: it's slightly faster to check if the resize is needed if (w != chars.Length) Array.Resize(ref chars, w); return chars; } This avoids exception overhead and at least one heap allocation. If no code points generate 2 characters, the method doesn't allocate anything extra than the result array. Vincent
