[ 
https://issues.apache.org/jira/browse/CONNECTORS-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119102#comment-15119102
 ] 

Karl Wright commented on CONNECTORS-1270:
-----------------------------------------

[~rafaharo]: It sounds ridiculous to worry about large documents until you 
realize that some people ingest millions of documents and would be very hard 
pressed to find the few ones of those that contain hundreds of megabytes of 
text.  We really can't make the assumption that it is safe to load that much 
document into memory.

The approach I took with the Tika Extractor was to load the document into 
memory if it was less than a certain size, otherwise it has to go to disk.  
I've used the same strategy here.

As for the NLP processing, we have a choice: either (1) process only the first 
N characters, or (2) process using a rolling buffer, and try to algorithmically 
remove any sentence fragments that we find because of our buffer approach.  
Right now, I'm leaning towards (1).  I should have that done by the end of the 
day.



> Import OpenNLP connector into trunk
> -----------------------------------
>
>                 Key: CONNECTORS-1270
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1270
>             Project: ManifoldCF
>          Issue Type: Task
>            Reporter: Karl Wright
>            Assignee: Rafa Haro
>             Fix For: ManifoldCF 2.4
>
>
> An OpenNLP connector has been contributed on github.  Need to import it into 
> MCF, first to a branch, then to trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to