Re: [RFC] OPENNLP-1833: Document-centric gRPC API — seeking feedback

Kristian Rickert Sat, 06 Jun 2026 13:56:56 -0700

Great!

I'll get what I can tomorrow - I've already done a couple impls of this on
other projects, but it's only to try and get a model going that we can mark
as experimental.


I've taken all the feedback and addressed it - I'll commit an initial
design and shape.  Then tomorrow I'll work on a simple impl.

Kristian

On Sat, Jun 6, 2026 at 1:16 PM Richard Zowalla <[email protected]> wrote:

> Hi all,
>
> +1 from me on this direction. A document-centric, language-neutral
> contract is a clear step up from chaining the three string-based sandbox
> RPCs, and Martin's suggestion of a neutral opennlp-core Document interface
> (kept separate from the protobuf wire type) seems like the right way to
> avoid letting gRPC details leak into the core API.
>
> On the build question (3): strong +1 for staying with Maven. The whole
> project is Maven, protobuf-maven-plugin covers proto generation cleanly,
> and introducing Gradle just for this module would add a second build system
> to maintain for no real benefit.
>
> A few gaps I noticed in the proposal that might be worth resolving before
> the proto gets locked in Phase 2:
>
> 1. Embeddings aren't actually in the proto. Kristian's email leads with
> GPU embeddings (CUDA/OpenVINO hot-swap) as a primary goal, but the sketched
> proto has no embedding message and no embedding PipelineStep, and
> SentenceVectorsDL isn't represented. Either embeddings are v1 in which case
> OpenNlpDocument needs a vector field and PipelineStep needs an entry or
> they're deferred and the non-goals should say so. Right now it reads as a
> primary goal that the contract doesn't cover.
>
> 2. Chunking is listed as a gap but not added. The "Gap" section calls out
> NER/chunking/embeddings as missing, but the proto only adds NER. Chunking
> is a standard OpenNLP tool, so it's worth an explicit v1-or-deferred
> decision.
>
> 3. ModelBundleRef is underspecified. A bare bundle_id gives clients no way
> to discover which bundles/profiles exist or what languages and steps they
> support. GetServiceInfo returns profile IDs but no bundle metadata. IMHO,
> it might be worth having it enumerate bundles with their supported
> steps/languages.
>
> 4. Partial-failure semantics are undefined. ProcessingDiagnostic exists,
> but it's not stated whether a failed step fails the whole AnalyzeDocument
> call or returns a partial document with an ERROR diagnostic. That affects
> every client, so worth nailing down early.
>
> 5. clear_adaptive_data vs. the stateless contract. That option implies
> adaptive state carried across calls, but the contract is described as 1:1
> stateless documents. Worth clarifying what adaptive data means in this
> model.
>
> None of these block starting in the sandbox - they're Phase 1/2
> proto-shape questions. Overall the direction has a lot of potential.
>
> Gruß
> Richard
>
> > Am 05.06.2026 um 15:13 schrieb Martin Wiesner <[email protected]>:
> >
> > Hi Kristian,
> >
> > thx for the initiative which I’d like to support hereby. I’ve been 'off
> in nature' for some days recently and thus my answer is delayed.
> >
> > A document centric approach is well-motivated in the Jira. For reasons
> of simplicity (and neutrality) we could add a opennlp-core api interface
> ’Document’.
> > This would allow us to model what a document is composed of, and (b) for
> other components to (re-) implement it by related requirements / ideas,
> such as outlined in OPENNLP-1833 by you („OpenNLPDocument“,
> „AnalyzeDocument“).
> >
> > If you want a core-api addition, say for ‚Document‘ or the like, keep in
> mind we can integrate it with the next 3.0.0-M4.
> > If this is not required / necessary in the first place: that is also
> fine - we can refactor / extract later on.
> > Currently, as is stands, we’re planning to cut a release at the end of
> June or early July. If you want to start things by
> >
> > Working first, in the opennlp-sandbox and evolving the current state
> seems reasonable, target being the core project in future cycles.
> >
> > Proposed package naming is fine from my pov, cf. JIRA issue.
> >
> > My views on your questions in the JIRA description:
> >
> > ad (1): go for retain in legacy pkg
> > ad (2): can imagine both paths, more likely is 3.1.x - as it feels 3.0.x
> is at the door soon (over or at the end of the summer 2026).
> > ad (3): stay with Maven (plz) if this is possible. Personally (!), no a
> big fan of Gradle… - personally speaking here, no strong opinion
> >
> > Happy about other’s comments.
> >
> > Thanks for the ideas and precise outline of 'em. The direction has a lot
> of potential.
> >
> > Best
> > Martin | mawiesne
> >
> >
> >> Am 22.05.2026 um 12:27 schrieb Kristian Rickert <[email protected]>:
> >>
> >> Hi OpenNLP devs,
> >>
> >> I've opened OPENNLP-1833 to propose evolving the opennlp-sandbox gRPC
> >> POC into ASF-native modules with a canonical OpenNlpDocument message and
> >> a primary AnalyzeDocument RPC (org.apache.opennlp.grpc.v1).
> >>
> >> JIRA: https://issues.apache.org/jira/browse/OPENNLP-1833
> >>
> >> Background: OpenNLP today is primarily in-process (API, CLI, UIMA).
> >> The sandbox POC (opennlp-grpc) exposes three separate string-based
> >> services; the ticket proposes a unified document contract and
> server-side
> >> pipeline orchestration.
> >>
> >> My primary goal is to integrate other language libraries through a gRPC
> >> contract.  This will allow the server to work with OpenNLP.  OpenNLP can
> >> use the client stubs to get data from the server, and the server would
> also
> >> use OpenNLP to expose the API to other languages.
> >>
> >> To be more specific: I'd like to introduce options that also utilize the
> >> GPU more directly for embeddings.  CUDA for nvidia cards and OpenVINO
> for
> >> Intel cards.  This would create a middle interface that can hot-swap on
> the
> >> server side.  Of course, these interfaces would also be their own
> builds.
> >>
> >> I'm planning to work on this in phases as outlined in the ticket:
> >>
> >>  - Phase 0/1: community RFC + design doc / full .proto definitions
> >>  - Phase 2+: implementation (will work on this while we discuss phase 1,
> >>  but open for changes)
> >>
> >> I'd appreciate feedback on a few points called out in the JIRA ticket.
> >>
> >> I can get a prototype up within a couple of weeks.
> >>
> >> Sandbox reference:
> >>
> >>
> https://github.com/apache/opennlp-sandbox/tree/OPENNLP-1833-grpc-expansion
> >>
> >> I'll post design updates and any draft .proto / docs to the ticket.
> >> Comments on the JIRA or replies to this thread are welcome although
> JIRA is
> >> preferred.
> >>
> >> Thanks,
> >> Kristian
> >
>
>

Re: [RFC] OPENNLP-1833: Document-centric gRPC API — seeking feedback

Reply via email to