[GitHub] [lucenenet] NightOwl888 opened a new pull request, #725: PERFORMANCE: Reduced char[] and string allocations (mostly in analysis)

GitBox Thu, 27 Oct 2022 01:10:57 -0700


NightOwl888 opened a new pull request, #725:
URL: https://github.com/apache/lucenenet/pull/725


   - `Lucene.Net.Document.CompressionTools::CompressString()`: Eliminated 
unnecessary `ToCharArray()` allocation
   - `Lucene.Net.Codecs.SimpleText.SimpleTextUtil::Write()`: Removed 
unnecessary `ToCharArray()` allocation
   - `Lucene.Net.Analysis.CharFilters.HTMLStripCharFilter`: Removed allocation 
during parse of hexadecimal integers by using `J2N.Numerics.Int32.TryParse()` 
overloads to specify index, length and radix.
   - Added a `CharArrayFormatter` struct to defer the allocation of 
constructing a string for `Debugging.Assert()` until after an assertion failure.
   - Added `maxStackByteLimit` system property that can be used to 
increase/decrease the stack threshold bytes where it switches to the heap.
   - `StemmerOverrideFilter.Builder` - Added overloads of `Add()` for `char[]` 
and `ICharSequence`. Added guard clauses. Modified to use `Span<char>` on the 
stack when under the `maxStackByteLimit` setting.
   - `Lucene.Net.Util.UnicodeUtil`: Added an overload of `UTF16toUTF8()` for 
`Span<T>` source to `BytesRef` destination. Added documentation and guard 
clauses. Renamed `s` parameter to `source` to be consistent across all 
overloads.
   - `Lucene.Net.Analysis.Util.CharacterUtils`: Use spans and stackalloc to 
reduce heap allocations when lowercasing.
   - `Lucene.Net.Util.TestUnicodeUtil::TestUTF8toUTF32()`: Added tests for 
`ICharSequence` and `char[]` overloads, changed the original test to test 
`string` instead of `char[]`.
   - `Lucene.Net.Analysis.Util.SegmentingTokenizerBase`: Removed unnecessary 
string allocations that were added during the port due to missing APIs that are 
now available.
   - `Lucene.Net.Analysis.Ja.GraphvizFormatter`: Removed unnecessary 
`surfaceForm` variable string allocation.
   - `Lucene.Net.Analysis.In.IndicNormalizer`: Replaced static constructor with 
inline `LoadScripts()` method. Moved location of scripts field to ensure 
decompositions is initialized first.
   - `Lucene.Net.Analysis.In.IndicNormalizer`: Refactored `ScriptData` from 
using `Dictionary<Regex, ScriptData>` to using `List<ScriptData>` which 
eliminated unnecessary hashtable lookup. Use static fields for `unknownScript` 
and `[ThreadStatic] previousScriptData` to cache the last script seen to 
optimize character script matching.
   - `Lucene.Net.Analysis.Th.ThaiWordBreaker`: Removed unnecessary string 
allocations and concatenation. Use `CharsRef` to reuse the same memory. Removed 
`Regex` and replaced with `UnicodeSet` to detect Thai code points, since the 
latter doesn't require converting to a string to detect a match.
   - `Lucene.Net.Analysis.Ga.IrishLowerCaseFilter`: Use stack and spans to 
reduce allocations and improve throughput.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [lucenenet] NightOwl888 opened a new pull request, #725: PERFORMANCE: Reduced char[] and string allocations (mostly in analysis)

Reply via email to