krickert commented on PR #1101:
URL: https://github.com/apache/opennlp/pull/1101#issuecomment-4754147085
Nice! Thanks for the fast feedback — I knew there was a lot packed in here
and totally expected
the questions. This code had caffeine behind it plus some LLM-assisted
review passes, but I read every token it produced and I own all of it.
I'll work through everything after I digest, but here's where my head's at
on the big ones:
## Splitting it up — I'd vote two stacks, not stacking onto the other PRs
Agreed it's too much for one review. Rather than stack onto the in-flight
PRs (this stuff is
fairly self-contained and would just tangle with those), I'd propose
splitting **this** along its
natural dependency seam into two stacked PRs:
**Stack 1 — Unicode normalization foundation** (the original OPENNLP-1850
scope):
`CharClass` / `CodePointSet`, the `White_Space` / `Dash` sets, the
`CharSequenceNormalizer` family
(NFC, NFKC, whitespace, dash, case-fold, accent-fold, quotes, digits, …),
`NormalizedText` +
`OffsetMap`, `TextNormalizer` / `TextAnalyzer`, plus the DL
`InferenceOptions` whitespace/dash
opt-ins. No segmentation, no perf engine — lowest risk, lands first.
**Stack 2 — UAX #29 tokenizer + the advanced layers** (builds on Stack 1):
`WordSegmenter` / `WordTokenizer` / `WordType` and the boundary engine (this
is where the perf
work and the transition/lookup tables live), the layered `Term` model,
confusable folding
(UTS #39), and the per-language profiles — all of which sit on top of Stack
1's normalizers + the
tokenizer.
That contains the perf/lookup-table conversation entirely in Stack 2 and
lets the foundation land
clean. Is that ok?
## "Take the lookup tables out of the API" — yep, 100% agreed
Good catch and an easy yes. `opennlp-api` should stay a thin contract
module. I'll push the
data/engine-bearing classes (`CharClass`, `CodePointSet`,
`UnicodeWhitespace`, `UnicodeDash`) down
into `opennlp-runtime` and leave only the contracts in the API —
`CharSequenceNormalizer`, and the
lightweight value types (`NormalizedText` / `OffsetMap`) if we still want
them reachable from the
DL side, since those carry zero tables. The UAX #29 tables and the
confusables data already live in
`opennlp-runtime`, so nothing table-shaped is left in the API once this
moves.
Key point though: this is a **module-placement** fix, not an API redesign.
The interface shape
itself I think is sound (more below) - we're just relocating the
implementation.
## The speed overlap, and the shape
At first as designed I discovered performance concerns which drove a lot of
the approach, so let me just put the numbers down. Measured with the project's
own JMH harness (2 forks, warmed up), against **Lucene 10.3.2**
`StandardTokenizer` and today's OpenNLP `SimpleTokenizer`, on a Latin corpus
(Mchars/s):
| tokenizer | Mchars/s | vs OpenNLP today | vs
Lucene 10 |
| --------------------------------- | -------- | ---------------- |
------------ |
| OpenNLP `SimpleTokenizer` (today) | 282 | 1.00× | 0.88×
|
| Lucene 10.3.2 `StandardTokenizer` | 320 | 1.13× | 1.00×
|
| **new — boundary scan** | **388** | **1.38×** |
**1.21×** |
| **new — streaming tokenize** | **335** | **1.19×** |
**1.05×** |
(On the CJK/emoji-heavy corpus the boundary scan is ~parity with Lucene and
still ~1.7× over `SimpleTokenizer`.)
Two things I'd flag:
- OpenNLP's tokenizer is currently *behind* Lucene; this puts it *ahead* —
while also being
current-Unicode-correct (17.0) and offset-preserving, which
`StandardTokenizer` isn't doing for
free.
- The boundary output is byte-identical before and after the perf work —
1944/1944 on the official
UAX #29 conformance suite the whole way through. The speed comes from a
transition table that's
*derived from* the readable rule cascade at class-load, so the rules stay
the source of truth and
the table can't drift from them.
On the overlap worry: this is tokenizer-algorithm perf, which is orthogonal
to the ME thread-safety / allocation work — different code, complementary in
spirit. Since there's no collision I think it's OK.
And on shape: the public surface (the `Tokenizer` interface impl, the
streaming `TokenHandler`, the `Term` / `Dimension` layering) is independent of
the engine internals — we could return or even swap the boundary engine without
touching the interface. So I'd like to keep the interface as-is and treat the
engine/tables as the thing we iterate on. Is that ok?
The Lucene head-to-head benchmark isn't meant to be a competition - but a
baselike. I can add it to the benchmark though but don't want to complicate
the build. No problem checking it in temporarily if you want to check it out
though.
Lookin' forward to some feedback. Thanks again for the quick turnaround!
This was honestly a lot of fun.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]