[
https://issues.apache.org/jira/browse/CONNECTORS-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118714#comment-15118714
]
Karl Wright commented on CONNECTORS-1270:
-----------------------------------------
Hmm, looking at the code, there's also this problem:
{code}
byte[] bytes = IOUtils.toByteArray(document.getBinaryStream());
...
// reset original stream
docCopy.setBinary(new ByteArrayInputStream(bytes), bytes.length);
{code}
This is usually unacceptable for a production connector; the entire document
has to be loaded into memory here, and that won't work, because memory
consumption has to be bounded. Unfortunately, looking at the OpenNLP
SentenceDetector API, there isn't any support for streaming:
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html
About the only thing we can reasonably do is the "rolling buffer" approach,
where we page in some chunk of document (e.g. 64K), do sentence detection on
that, then chuck the first 3/4, and page in another 64K, doing overlapping
sentence detection with that last chunk and the next chunk, and where we detect
overlapping sentences at the start of each subsequent chunk and chuck them.
I'm not worried about run-on sentences here.
As for the need to replay the content stream, the best way to do that is to
create a duplicate of the original RepositoryDocument object using the standard
MCF support for that, add the new metadata, and close it off after it has been
handed downstream.
Both the proposed sets of changes seem critical to me, and they're not terribly
easy either. The actual flow change (so documents don't hit memory completely)
I will tackle, but there is still a lot to do even when that's done.
> Import OpenNLP connector into trunk
> -----------------------------------
>
> Key: CONNECTORS-1270
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1270
> Project: ManifoldCF
> Issue Type: Task
> Reporter: Karl Wright
> Assignee: Rafa Haro
> Fix For: ManifoldCF 2.4
>
>
> An OpenNLP connector has been contributed on github. Need to import it into
> MCF, first to a branch, then to trunk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)