NightOwl888 opened a new issue, #1153:
URL: https://github.com/apache/lucenenet/issues/1153

   ### Is there an existing issue for this?
   
   - [x] I have searched the existing issues
   
   ### Task description
   
   In `BigramDictionary` and `WordDictionary` of `Lucene.Net.Analysis.SmartCn`, 
the code that loads from files uses `ByteBuffer` to read `int` values as little 
endian. It was done this way in Java because the C-derived code uses little 
endian byte order but the default in Java is to use big endian byte order. 
However, in .NET, the default byte order is little endian so we can eliminate 
these extra allocations by wrapping the `FileStream` in a `BinaryReader` and 
calling `BinaryReader.ReadInt32()`. The `intBuffer` allocation can be 
eliminated in addition to the allocations caused by `ByteBuffer.Wrap()` in each 
loop. But the main benefit will be to decouple us from `ByteBuffer`.
   
   In addition, in `BigramDictionary`, we can eliminate the calls to 
`tmpword.ToCharArray()` by using `ReadOnlySpan<char>`.
   
   ```c#
   char[] carray = tmpword.ToCharArray();
   ```
   
   can be changed to:
   
   ```c#
   ReadOnlySpan<char> caarray = tmpword.AsSpan();
   ```
   
   And we should change the `caarray` parameter to `ReadOnlySpan<char>` in the 
`Hash1()`, `Hash2()`, `GetItemIndex()`, `GetBigramItemIndex()` and 
`GetFrequency()` methods. No other changes to those methods should be required.
   
   > Note: the `wordItem_charArrayTable` in `WordDictionary` could be changed 
to use `ReadOnlyMemory<char>`, but I don't think it is worth the added 
complexity and will diverge significantly from upstream. The references would 
have to be maintained for all of the strings to keep them in scope.
   
   ### Tests
   
   Note that this code loads from disk and in the past has only been tested 
manually. There is some info here:
   
   
https://github.com/apache/lucenenet/blob/a0578d6ea2c17c06be925adb15acd3c64d5fc824/src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs#L106-L163
   
   that contains info about how to test it.
   
   It would be great if we could automate the tests, but it will require 
getting enough info about how to create a small dictionary file to load so we 
don't add multiple MB of data to the test project. A few KB is all that is 
required to ensure it loads from disk correctly. There would also need to be a 
way devised to determine that loading was successful, which will require some 
analysis and exploration.
   
   > Note also that the Kuromoji project loads dictionary data in a similar 
manner and also doesn't have automated tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to