Great! I'll get what I can tomorrow - I've already done a couple impls of this on other projects, but it's only to try and get a model going that we can mark as experimental.
I've taken all the feedback and addressed it - I'll commit an initial design and shape. Then tomorrow I'll work on a simple impl. Kristian On Sat, Jun 6, 2026 at 1:16 PM Richard Zowalla <[email protected]> wrote: > Hi all, > > +1 from me on this direction. A document-centric, language-neutral > contract is a clear step up from chaining the three string-based sandbox > RPCs, and Martin's suggestion of a neutral opennlp-core Document interface > (kept separate from the protobuf wire type) seems like the right way to > avoid letting gRPC details leak into the core API. > > On the build question (3): strong +1 for staying with Maven. The whole > project is Maven, protobuf-maven-plugin covers proto generation cleanly, > and introducing Gradle just for this module would add a second build system > to maintain for no real benefit. > > A few gaps I noticed in the proposal that might be worth resolving before > the proto gets locked in Phase 2: > > 1. Embeddings aren't actually in the proto. Kristian's email leads with > GPU embeddings (CUDA/OpenVINO hot-swap) as a primary goal, but the sketched > proto has no embedding message and no embedding PipelineStep, and > SentenceVectorsDL isn't represented. Either embeddings are v1 in which case > OpenNlpDocument needs a vector field and PipelineStep needs an entry or > they're deferred and the non-goals should say so. Right now it reads as a > primary goal that the contract doesn't cover. > > 2. Chunking is listed as a gap but not added. The "Gap" section calls out > NER/chunking/embeddings as missing, but the proto only adds NER. Chunking > is a standard OpenNLP tool, so it's worth an explicit v1-or-deferred > decision. > > 3. ModelBundleRef is underspecified. A bare bundle_id gives clients no way > to discover which bundles/profiles exist or what languages and steps they > support. GetServiceInfo returns profile IDs but no bundle metadata. IMHO, > it might be worth having it enumerate bundles with their supported > steps/languages. > > 4. Partial-failure semantics are undefined. ProcessingDiagnostic exists, > but it's not stated whether a failed step fails the whole AnalyzeDocument > call or returns a partial document with an ERROR diagnostic. That affects > every client, so worth nailing down early. > > 5. clear_adaptive_data vs. the stateless contract. That option implies > adaptive state carried across calls, but the contract is described as 1:1 > stateless documents. Worth clarifying what adaptive data means in this > model. > > None of these block starting in the sandbox - they're Phase 1/2 > proto-shape questions. Overall the direction has a lot of potential. > > Gruß > Richard > > > Am 05.06.2026 um 15:13 schrieb Martin Wiesner <[email protected]>: > > > > Hi Kristian, > > > > thx for the initiative which I’d like to support hereby. I’ve been 'off > in nature' for some days recently and thus my answer is delayed. > > > > A document centric approach is well-motivated in the Jira. For reasons > of simplicity (and neutrality) we could add a opennlp-core api interface > ’Document’. > > This would allow us to model what a document is composed of, and (b) for > other components to (re-) implement it by related requirements / ideas, > such as outlined in OPENNLP-1833 by you („OpenNLPDocument“, > „AnalyzeDocument“). > > > > If you want a core-api addition, say for ‚Document‘ or the like, keep in > mind we can integrate it with the next 3.0.0-M4. > > If this is not required / necessary in the first place: that is also > fine - we can refactor / extract later on. > > Currently, as is stands, we’re planning to cut a release at the end of > June or early July. If you want to start things by > > > > Working first, in the opennlp-sandbox and evolving the current state > seems reasonable, target being the core project in future cycles. > > > > Proposed package naming is fine from my pov, cf. JIRA issue. > > > > My views on your questions in the JIRA description: > > > > ad (1): go for retain in legacy pkg > > ad (2): can imagine both paths, more likely is 3.1.x - as it feels 3.0.x > is at the door soon (over or at the end of the summer 2026). > > ad (3): stay with Maven (plz) if this is possible. Personally (!), no a > big fan of Gradle… - personally speaking here, no strong opinion > > > > Happy about other’s comments. > > > > Thanks for the ideas and precise outline of 'em. The direction has a lot > of potential. > > > > Best > > Martin | mawiesne > > > > > >> Am 22.05.2026 um 12:27 schrieb Kristian Rickert <[email protected]>: > >> > >> Hi OpenNLP devs, > >> > >> I've opened OPENNLP-1833 to propose evolving the opennlp-sandbox gRPC > >> POC into ASF-native modules with a canonical OpenNlpDocument message and > >> a primary AnalyzeDocument RPC (org.apache.opennlp.grpc.v1). > >> > >> JIRA: https://issues.apache.org/jira/browse/OPENNLP-1833 > >> > >> Background: OpenNLP today is primarily in-process (API, CLI, UIMA). > >> The sandbox POC (opennlp-grpc) exposes three separate string-based > >> services; the ticket proposes a unified document contract and > server-side > >> pipeline orchestration. > >> > >> My primary goal is to integrate other language libraries through a gRPC > >> contract. This will allow the server to work with OpenNLP. OpenNLP can > >> use the client stubs to get data from the server, and the server would > also > >> use OpenNLP to expose the API to other languages. > >> > >> To be more specific: I'd like to introduce options that also utilize the > >> GPU more directly for embeddings. CUDA for nvidia cards and OpenVINO > for > >> Intel cards. This would create a middle interface that can hot-swap on > the > >> server side. Of course, these interfaces would also be their own > builds. > >> > >> I'm planning to work on this in phases as outlined in the ticket: > >> > >> - Phase 0/1: community RFC + design doc / full .proto definitions > >> - Phase 2+: implementation (will work on this while we discuss phase 1, > >> but open for changes) > >> > >> I'd appreciate feedback on a few points called out in the JIRA ticket. > >> > >> I can get a prototype up within a couple of weeks. > >> > >> Sandbox reference: > >> > >> > https://github.com/apache/opennlp-sandbox/tree/OPENNLP-1833-grpc-expansion > >> > >> I'll post design updates and any draft .proto / docs to the ticket. > >> Comments on the JIRA or replies to this thread are welcome although > JIRA is > >> preferred. > >> > >> Thanks, > >> Kristian > > > >
