mikemccand commented on code in PR #13239:
URL: https://github.com/apache/lucene/pull/13239#discussion_r1546413938


##########
lucene/analysis/common/src/test/org/apache/lucene/analysis/email/TLDs.txt:
##########
@@ -1,6 +1,5 @@
 # Generated from IANA TLD Database (gradlew generateTlds).aaa
 aarp
-abarth

Review Comment:
   Curious that all these TLDs are deleted...



##########
lucene/analysis/icu/src/data/uax29/Default.rbbi:
##########
@@ -17,6 +17,12 @@
 # This file is from ICU (with some small modifications, to avoid CJK 
dictionary break,
 # and status code change related to that)
 #
+# To update this file: grab rule file corresponding to your ICU version, e.g.:
+#   
https://github.com/unicode-org/icu/blob/release-74-2/icu4c/source/data/brkitr/rules/word.txt
+#
+# * Prevent dictionary break: disable final rule that would chain han and kana
+# * For kana rules, change from 400 to 300 status (since there's no dictionary 
break)

Review Comment:
   Thank you for adding these instructions!
   
   Where do we use this `Default.rbbi`?  Is it for users to pass to 
`ICUTokenizer` to run (our slightly modified) UAX #29 tokenization?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to