aglinxinyuan opened a new pull request, #5658:
URL: https://github.com/apache/texera/pull/5658

   ### What changes were proposed in this PR?
   
   Pin behavior of the Lucene `Analyzer` used by the keyword-search operator 
when the user opts into case-sensitive matching. The abstraction skips the 
lowercasing pipeline used by `StandardAnalyzer`, so a regression here would 
silently downgrade case-sensitive search. No production-code changes.
   
   | Spec | Source class | Tests |
   | --- | --- | --- |
   | `CaseSensitiveAnalyzerSpec` | `CaseSensitiveAnalyzer` | 13 |
   
   Spec file name follows the `<srcClassName>Spec.scala` one-to-one convention.
   
   **Behavior pinned**
   
   | Surface | Contract |
   | --- | --- |
   | Mixed-case input | every emitted token preserves its original case |
   | All-uppercase / all-lowercase tokens | preserved (no normalization in 
either direction) |
   | Single-space splitting | tokens are separated cleanly |
   | Tabs and newlines | also split tokens |
   | Collapsed whitespace runs | no empty tokens emitted |
   | Embedded punctuation (`abc,def`) | stays one token (`WhitespaceTokenizer` 
only splits on whitespace) |
   | Sentence-final punctuation (`Hello, world!`) | stays attached (`Hello,`, 
`world!`) |
   | Empty input | no tokens |
   | Pure-whitespace input | no tokens |
   | `StopFilter` with `CharArraySet.EMPTY_SET` | English stop words (`the` / 
`and` / `a`) are NOT removed (vs `StandardAnalyzer`'s default behavior) |
   | Different field names | same tokenization (field-name independent) |
   | Successive `tokenStream` calls | each gets its own independent stream |
   
   The harness uses the canonical Lucene `reset → incrementToken → end → close` 
lifecycle and collects `CharTermAttribute` values into a buffer — same pattern 
any future analyzer spec in this codebase should follow.
   
   ### Any related issues, documentation, discussions?
   
   Closes #5654.
   
   ### How was this PR tested?
   
   Pure unit-test addition; verified locally with:
   
   - `sbt "WorkflowOperator/testOnly 
org.apache.texera.amber.operator.keywordSearch.CaseSensitiveAnalyzerSpec"` — 13 
tests, all green
   - `sbt scalafmtCheckAll` — clean
   - CI to confirm
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7 [1M context])


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to