[ 
https://issues.apache.org/jira/browse/CONNECTORS-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118714#comment-15118714
 ] 

Karl Wright commented on CONNECTORS-1270:
-----------------------------------------

Hmm, looking at the code, there's also this problem:

{code}
    byte[] bytes = IOUtils.toByteArray(document.getBinaryStream());
...
    // reset original stream
    docCopy.setBinary(new ByteArrayInputStream(bytes), bytes.length);
{code}

This is usually unacceptable for a production connector; the entire document 
has to be loaded into memory here, and that won't work, because memory 
consumption has to be bounded.  Unfortunately, looking at the OpenNLP 
SentenceDetector API, there isn't any support for streaming:

https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html

About the only thing we can reasonably do is the "rolling buffer" approach, 
where we page in some chunk of document (e.g. 64K), do sentence detection on 
that, then chuck the first 3/4, and page in another 64K, doing overlapping 
sentence detection with that last chunk and the next chunk, and where we detect 
overlapping sentences at the start of each subsequent chunk and chuck them.  
I'm not worried about run-on sentences here.

As for the need to replay the content stream, the best way to do that is to 
create a duplicate of the original RepositoryDocument object using the standard 
MCF support for that, add the new metadata, and close it off after it has been 
handed downstream.

Both the proposed sets of changes seem critical to me, and they're not terribly 
easy either.  The actual flow change (so documents don't hit memory completely) 
I will tackle, but there is still a lot to do even when that's done.







> Import OpenNLP connector into trunk
> -----------------------------------
>
>                 Key: CONNECTORS-1270
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1270
>             Project: ManifoldCF
>          Issue Type: Task
>            Reporter: Karl Wright
>            Assignee: Rafa Haro
>             Fix For: ManifoldCF 2.4
>
>
> An OpenNLP connector has been contributed on github.  Need to import it into 
> MCF, first to a branch, then to trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to