Re: [PR] OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) (opennlp)

via GitHub Thu, 25 Jun 2026 11:16:12 -0700


krickert commented on PR #1110:
URL: https://github.com/apache/opennlp/pull/1110#issuecomment-4802809191


   @rzo1 Both addressed (tip `f2d1d8cc`).
   
   **Loader symmetry.** I kept the two loaders deliberately different but 
documented why, and closed the test gap. The difference is real rather than an 
oversight: `WordBreakProperty.txt` always has a `code ; property` shape, so a 
missing `;` is corruption and fails loud. `ExtendedPictographic.txt` is a 
*filtered single-property* file (only `Extended_Pictographic`, with the 
property column stripped), so a line with no `;` is the normal, well-formed 
case — the code points are taken whole. Forcing it to fail on a missing `;` 
would reject valid data; I added a comment on `ExtendedPictographic.parse` 
spelling that out. For the test asymmetry: `ExtendedPictographic.parse` is now 
package-visible and has a malformed-data test (`parseFailsLoudOnMalformedHex`) 
asserting that a non-hex code-point column fails loud with 
`IllegalArgumentException` naming the resource — the same fail-loud contract 
`WordBreakProperty` already had.
   
   **`WordType.of` leading-script heuristic.** Added a note on `WordType.of`: 
the script category is taken from the first script code point in the range; UAX 
#29 word segments are single-script in practice, so for an unusual mixed-script 
run this reports the leading script rather than a per-character determination.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) (opennlp)

Reply via email to