Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

via GitHub Sat, 18 Oct 2025 10:45:10 -0700


NightOwl888 commented on code in PR #1154:
URL: https://github.com/apache/lucenenet/pull/1154#discussion_r2407501929



##########
src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs:
##########
@@ -254,80 +254,83 @@ private void Load(string dictRoot)
         /// <summary>
         /// Load the datafile into this <see cref="BigramDictionary"/>
         /// </summary>
-        /// <param name="dctFilePath">dctFilePath path to the Bigramdictionary 
(bigramdict.dct)</param>
+        /// <param name="dctFilePath">Path to the Bigramdictionary 
(bigramdict.dct)</param>
         /// <exception cref="IOException">If there is a low-level I/O 
error</exception>
         public virtual void LoadFromFile(string dctFilePath)
         {
-            int i, cnt, length, total = 0;
             // The file only counted 6763 Chinese characters plus 5 reserved 
slots 3756~3760.
             // The 3756th is used (as a header) to store information.
-            int[]
-            buffer = new int[3];
-            byte[] intBuffer = new byte[4];
-            string tmpword;
+
+            // LUCENENET: Removed buffer and intBuffer arrays since 
BinaryReader handles reading values directly in a more type-safe and readable 
way.
+            // LUCENENET specific - refactored constants for clarity
+            const int HEADER_POSITION = 3755;
+            const int MAX_VALID_LENGTH = 1000;
+
             //using (RandomAccessFile dctFile = new 
RandomAccessFile(dctFilePath, "r"))
             using var dctFile = new FileStream(dctFilePath, FileMode.Open, 
FileAccess.Read);
+            using var reader = new BinaryReader(dctFile);
 
             // GB2312 characters 0 - 6768
-            for (i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
+            for (int i = GB2312_FIRST_CHAR; i < GB2312_FIRST_CHAR + 
CHAR_NUM_IN_FILE; i++)
             {
-                string currentStr = GetCCByGB2312Id(i);
-                // if (i == 5231)
-                // System.out.println(i);
 
-                dctFile.Read(intBuffer, 0, intBuffer.Length);
-                // the dictionary was developed for C, and byte order must be 
converted to work with Java
-                cnt = 
ByteBuffer.Wrap(intBuffer).SetOrder(ByteOrder.LittleEndian).GetInt32();
+                string currentStr = GetCCByGB2312Id(i); 
+                int cnt;
+                try
+                {
+                   cnt = reader.ReadInt32();  // LUCENENET: Use BinaryReader 
methods instead of ByteBuffer
+                }
+                catch (EndOfStreamException)

Review Comment:
   I think I am going to have to reconsider this decision.
   
   If the user supplies invalid input, we need them to be informed with an 
exception. This information is too valuable to swallow.
   
   Aside from that, we are not actually doing a valid load test because of this 
"out".
   
   Given that the file from Lucene is 2MB and our test file is 500KB, I don't 
think it is such a big deal to include the entire  file for the test. So, let's 
adjust the tests to use the original input from Lucene so we are comparing 
apples to apples.
   
   
[custom-dictionary-input.zip](https://github.com/user-attachments/files/22727275/custom-dictionary-input.zip)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests [lucenenet]

Reply via email to