[PR] OPENNLP-1833 - gRPC document analysis service with embeddings and chunking (opennlp-sandbox)

via GitHub Wed, 10 Jun 2026 06:57:29 -0700


krickert opened a new pull request, #493:
URL: https://github.com/apache/opennlp-sandbox/pull/493


   See https://issues.apache.org/jira/browse/OPENNLP-1833
   
   Draft for early feedback on the gRPC expansion. This branch restructures the
   opennlp-grpc sandbox module around a v1 document-centric API and builds it 
out
   into a working analysis server.
   
   ## API / contract
   - AnalyzeDocument RPC over a v1 proto contract (opennlp_document, 
opennlp_pipeline, opennlp_service)
   - Explicit offset encoding selection (UTF-8 byte / UTF-16 / code point) with 
server-side mapping
   - AnalysisProfile / AnalysisOptions with strict validation: unsupported 
steps, backends and
     models are rejected with meaningful gRPC status codes instead of being 
silently ignored
   - Model bundle catalog (ListModelBundles) reflecting actually loaded 
components
   
   ## Embeddings
   - EmbeddingProvider abstraction with ONNX Runtime CPU and CUDA 
implementations behind a
     strict factory (model.embedder.backend=onnx|cuda)
   - Models declared per id in server config, loaded eagerly at startup, 
dimension read from
     ONNX session metadata
   - Standard single-segment BERT encoding (mask=1, types=0), OOV-to-UNK 
mapping, truncation
     at 512 wordpieces, deterministic native resource management
   - GPU build via -Dgpu swaps onnxruntime for onnxruntime_gpu so the CPU and 
CUDA runtimes
     never coexist on the classpath
   
   ## Chunking (RAG-style segmentation)
   - sentence, token-window and semantic algorithms via chunk_embed_configs and 
PIPELINE_STEP_CHUNK
   - Per-chunk embeddings with multiple models per group, plus group statistics
   - Semantic chunking on consecutive-sentence cosine similarity with 
percentile or fixed
     thresholds and min/max chunk size constraints
   
   ## Testing
   - 41 unit tests covering providers, factory, chunkers, offset mapping and 
analyzer policies,
     green on both CPU and GPU build flavors
   - End-to-end verified against the public all-MiniLM-L6-v2 ONNX export: 
server embeddings are
     bit-identical to the corrected SentenceVectorsDL output (see OPENNLP-1836)
   
   ## Known follow-ups
   - Classpath model discovery (opennlp-models-*) breaks inside the shaded jar 
because each
     model jar ships a root-level model.properties; the shaded server currently 
needs explicit
     model.sentence_detector.path / model.tokenizer.path config
   - OpenVINO and remote/composite providers are future work behind the same 
provider interface


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OPENNLP-1833 - gRPC document analysis service with embeddings and chunking (opennlp-sandbox)

Reply via email to