Re: [PR] [feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer [doris]

via GitHub Thu, 02 Jul 2026 00:08:55 -0700


Ryan19929 commented on PR #64667:
URL: https://github.com/apache/doris/pull/64667#issuecomment-4863101175


   > Since this is modeled after Lucene Kuromoji, have you checked how the 
Doris implementation performs in practice? A small benchmark for indexing 
throughput would be helpful.
   
   ## [Non-blocking] Viterbi hot path allocates per byte position; +16% 
measured with a small change
   
   here is what I measured (Release build, single pipeline task, sql cache off, 
50×1MB natural text, `sum(length(TOKENIZE(...)))`, best of 4):
   
   | parser                     | corpus | 50MB time | throughput      |
   |----------------------------|--------|-----------|-----------------|
   | kuromoji (this PR)         | ja     | 19.3 s    | 2.6 MB/s        |
   | kuromoji (prototype below) | ja     | 16.2 s    | 3.1 MB/s (+16%) |
   | icu                        | ja     | 11.2 s    | 4.5 MB/s        |
   | ik                         | zh     | 11.0 s    | 4.5 MB/s        |
   | chinese                    | zh     | 6.7 s     | 7.5 MB/s        |
   
   To be fair, the other rows are not apples-to-apples baselines: icu does much 
lighter work on Japanese than a full lattice/Viterbi morphological analysis, 
and ik/chinese run on a Chinese corpus, so some gap is expected and inherent to 
what kuromoji does. I'm only including them as a rough sense of scale — 
kuromoji is the slowest builtin analyzer but stays within the same order of 
magnitude, so I don't see this as blocking.
   
   That said, a profile shows a good chunk of the time goes to avoidable heap 
allocations, so there are two cheap wins in `KuromojiViterbi::segment()`:
   
   - `std::vector<std::vector<int>> ending_at(n + 1)` 
(`kuromoji_viterbi.cpp:124`): one vector object per document byte (~1M 
constructions for a 1MB doc) plus one heap allocation per reachable position.
   - `matches` declared inside the per-position loop 
(`kuromoji_viterbi.cpp:170`): one malloc/free per position; 
`common_prefix_search()` already clears it, so it can be hoisted.
   
   Prototype of exactly these two changes: 19.3s → 16.2s, byte-identical 
tokenizer output on the 50MB corpus. The `<` → `<=` flip preserves the original 
tie-break (chain iterates newest-first, original vector oldest-first).
   
   ```diff
   --- a/be/src/storage/index/inverted/analyzer/kuromoji/kuromoji_viterbi.cpp
   +++ b/be/src/storage/index/inverted/analyzer/kuromoji/kuromoji_viterbi.cpp
   @@ -121,18 +121,25 @@ void KuromojiViterbi::segment(std::string_view text, 
std::vector<KuromojiMorphem
        }
    
        std::vector<VNode> nodes;
   -    std::vector<std::vector<int>> ending_at(n + 1); // node indices ending 
at each byte position
   +    // Intrusive per-end-position chain: end_head[e] is the most recent 
node index
   +    // ending at byte position e, end_next[i] links to the previous one. 
This avoids
   +    // allocating n+1 std::vector objects per document.
   +    std::vector<int32_t> end_head(n + 1, -1);
   +    std::vector<int32_t> end_next;
    
        // BOS (index 0): ends at position 0, context id 0, zero cost.
        nodes.push_back(VNode {0, 0, 0, 0, 0, false, 0, 0, -1});
   -    ending_at[0].push_back(0);
   +    end_next.push_back(-1);
   +    end_head[0] = 0;
    
        // Add a node and relax it against all nodes ending at its start 
position.
        auto add_node = [&](uint32_t s, uint32_t e, int16_t lid, int16_t rid, 
int16_t wcost, bool known,
                            uint32_t wid) {
            int64_t best = KMJ_INF;
            int best_prev = -1;
   -        for (int pe : ending_at[s]) {
   +        // Chain is iterated newest-first; "<=" keeps the oldest node on 
cost ties,
   +        // matching the original insertion-order "<" selection exactly.
   +        for (int pe = end_head[s]; pe >= 0; pe = end_next[pe]) {
                const VNode& pv = nodes[static_cast<std::size_t>(pe)];
                if (pv.total_cost >= KMJ_INF) {
                    continue;
   @@ -140,7 +147,7 @@ void KuromojiViterbi::segment(std::string_view text, 
std::vector<KuromojiMorphem
                const int64_t c =
                        pv.total_cost + 
_dict.connection_cost(static_cast<uint32_t>(pv.right_id),
                                                              
static_cast<uint32_t>(lid));
   -            if (c < best) {
   +            if (c <= best) {
                    best = c;
                    best_prev = pe;
                }
   @@ -154,12 +161,15 @@ void KuromojiViterbi::segment(std::string_view text, 
std::vector<KuromojiMorphem
            const auto idx = static_cast<int>(nodes.size());
            nodes.push_back(
                    VNode {s, e, lid, rid, wcost, known, wid, best + wcost + 
penalty, best_prev});
   -        ending_at[e].push_back(idx);
   +        end_next.push_back(end_head[e]);
   +        end_head[e] = idx;
        };
    
        uint32_t pos = 0;
   +    // Reused across positions; common_prefix_search clears it on entry.
   +    std::vector<KuromojiDictionary::PrefixMatch> matches;
        while (pos < n) {
   -        if (ending_at[pos].empty()) {
   +        if (end_head[pos] < 0) {
                pos += decode_utf8(text, pos).len; // unreachable boundary; skip
                continue;
            }
   @@ -167,7 +177,6 @@ void KuromojiViterbi::segment(std::string_view text, 
std::vector<KuromojiMorphem
            const auto before = nodes.size();
    
            // System-dictionary words (common-prefix search).
   -        std::vector<KuromojiDictionary::PrefixMatch> matches;
            _dict.common_prefix_search(text.data() + pos, n - pos, &matches);
            bool any_known = false;
            for (const auto& mt : matches) {
   @@ -219,14 +228,14 @@ void KuromojiViterbi::segment(std::string_view text, 
std::vector<KuromojiMorphem
        // EOS: best node ending at n connected to the EOS context (id 0).
        int64_t best = KMJ_INF;
        int best_prev = -1;
   -    for (int pe : ending_at[n]) {
   +    for (int pe = end_head[n]; pe >= 0; pe = end_next[pe]) {
            const VNode& pv = nodes[static_cast<std::size_t>(pe)];
            if (pv.total_cost >= KMJ_INF) {
                continue;
            }
            const int64_t c =
                    pv.total_cost + 
_dict.connection_cost(static_cast<uint32_t>(pv.right_id), 0);
   -        if (c < best) {
   +        if (c <= best) {
                best = c;
                best_prev = pe;
            }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer [doris]

Reply via email to