[
https://issues.apache.org/jira/browse/CONNECTORS-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119102#comment-15119102
]
Karl Wright commented on CONNECTORS-1270:
-----------------------------------------
[~rafaharo]: It sounds ridiculous to worry about large documents until you
realize that some people ingest millions of documents and would be very hard
pressed to find the few ones of those that contain hundreds of megabytes of
text. We really can't make the assumption that it is safe to load that much
document into memory.
The approach I took with the Tika Extractor was to load the document into
memory if it was less than a certain size, otherwise it has to go to disk.
I've used the same strategy here.
As for the NLP processing, we have a choice: either (1) process only the first
N characters, or (2) process using a rolling buffer, and try to algorithmically
remove any sentence fragments that we find because of our buffer approach.
Right now, I'm leaning towards (1). I should have that done by the end of the
day.
> Import OpenNLP connector into trunk
> -----------------------------------
>
> Key: CONNECTORS-1270
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1270
> Project: ManifoldCF
> Issue Type: Task
> Reporter: Karl Wright
> Assignee: Rafa Haro
> Fix For: ManifoldCF 2.4
>
>
> An OpenNLP connector has been contributed on github. Need to import it into
> MCF, first to a branch, then to trunk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)