krickert opened a new pull request, #493: URL: https://github.com/apache/opennlp-sandbox/pull/493
See https://issues.apache.org/jira/browse/OPENNLP-1833 Draft for early feedback on the gRPC expansion. This branch restructures the opennlp-grpc sandbox module around a v1 document-centric API and builds it out into a working analysis server. ## API / contract - AnalyzeDocument RPC over a v1 proto contract (opennlp_document, opennlp_pipeline, opennlp_service) - Explicit offset encoding selection (UTF-8 byte / UTF-16 / code point) with server-side mapping - AnalysisProfile / AnalysisOptions with strict validation: unsupported steps, backends and models are rejected with meaningful gRPC status codes instead of being silently ignored - Model bundle catalog (ListModelBundles) reflecting actually loaded components ## Embeddings - EmbeddingProvider abstraction with ONNX Runtime CPU and CUDA implementations behind a strict factory (model.embedder.backend=onnx|cuda) - Models declared per id in server config, loaded eagerly at startup, dimension read from ONNX session metadata - Standard single-segment BERT encoding (mask=1, types=0), OOV-to-UNK mapping, truncation at 512 wordpieces, deterministic native resource management - GPU build via -Dgpu swaps onnxruntime for onnxruntime_gpu so the CPU and CUDA runtimes never coexist on the classpath ## Chunking (RAG-style segmentation) - sentence, token-window and semantic algorithms via chunk_embed_configs and PIPELINE_STEP_CHUNK - Per-chunk embeddings with multiple models per group, plus group statistics - Semantic chunking on consecutive-sentence cosine similarity with percentile or fixed thresholds and min/max chunk size constraints ## Testing - 41 unit tests covering providers, factory, chunkers, offset mapping and analyzer policies, green on both CPU and GPU build flavors - End-to-end verified against the public all-MiniLM-L6-v2 ONNX export: server embeddings are bit-identical to the corrected SentenceVectorsDL output (see OPENNLP-1836) ## Known follow-ups - Classpath model discovery (opennlp-models-*) breaks inside the shaded jar because each model jar ships a root-level model.properties; the shaded server currently needs explicit model.sentence_detector.path / model.tokenizer.path config - OpenVINO and remote/composite providers are future work behind the same provider interface -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
