Re: Technical question on repo connector dev

2019-10-04 Thread Karl Wright
Hi Julien,

The checkDocumentNeedsReindexing() method is meant to be used inside
processDocuments() for the specific document you are checking.  So you can
convert your URI to a set of JSON documents, if the document identifier is
a URI, But you will probably want to put the actual data for the document
in carrydown information.  You will need to also create some kind of
non-URI document ID too.

Karl


On Fri, Oct 4, 2019 at 1:36 PM  wrote:

> Hi,
>
>
>
> I am facing a simple technical case that I am not sure how to deal with,
> concerning the development of a repository connector.
>
>
>
> I want to develop a repo connector using the ADD_CHANGE_DELETE model that
> will normally add seed documents, and each seed document will produce
> several documents.
> The problem is that each produced document from a seed doc is instantly
> ingest-able and does not need to be processed.
>
>
>
> The use case here is that the addSeedDocuments method will call an API that
> will provide several URIs (seeds).
>
> In the processDocuments method, each URI provides a JSON array containing
> JSON objects and those JSON objects are meant to become repository
> documents
> and ingested.
> So the logic would be to use the activities.addDocumentReference for each
> JSON object before I can use the activities.checkDocumentNeedsReindexing
> (each JSON object has an id and a version field) and then ingest the
> document. But by doing this, I am afraid that the processDocuments method
> will be called with those newly referenced docs while they do not need to
> be
> processed.
>
>
>
> Any suggestion about how to deal with this use case is welcome.
>
>
>
> Thanks,
> Julien
>
>


Technical question on repo connector dev

2019-10-04 Thread julien.massiera
Hi, 

 

I am facing a simple technical case that I am not sure how to deal with,
concerning the development of a repository connector. 

 

I want to develop a repo connector using the ADD_CHANGE_DELETE model that
will normally add seed documents, and each seed document will produce
several documents. 
The problem is that each produced document from a seed doc is instantly
ingest-able and does not need to be processed.

 

The use case here is that the addSeedDocuments method will call an API that
will provide several URIs (seeds).

In the processDocuments method, each URI provides a JSON array containing
JSON objects and those JSON objects are meant to become repository documents
and ingested. 
So the logic would be to use the activities.addDocumentReference for each
JSON object before I can use the activities.checkDocumentNeedsReindexing
(each JSON object has an id and a version field) and then ingest the
document. But by doing this, I am afraid that the processDocuments method
will be called with those newly referenced docs while they do not need to be
processed.

 

Any suggestion about how to deal with this use case is welcome. 

 

Thanks,
Julien



[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-04 Thread Donald Van den Driessche (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944414#comment-16944414
 ] 

Donald Van den Driessche commented on CONNECTORS-1625:
--

We are running this pdf as the one and only document.

It's manifold 2.12. We tried to parse it through Tika locally with Tika 1.18 
and 1.22 and both succeeded.

We've set the heap space to 3G and 5G and still the same issues.

I've now read somewhere that disk space might be used. But since the file is 
only 21MB large, I don't see how much disk space might be used.

 

> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-04 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944386#comment-16944386
 ] 

Karl Wright commented on CONNECTORS-1625:
-

Also, FWIW, the default Java memory sizes on the example are not guaranteed to 
allow processing of N simultaneous Tika extractions (one per worker thread) of 
the sort that require more memory.  Memory sizes allocated to the JVM are 
settable in the start-options files, and the first thing you want to do is 
increase those values to see if the problem goes away for you.


> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-04 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944385#comment-16944385
 ] 

Karl Wright edited comment on CONNECTORS-1625 at 10/4/19 10:29 AM:
---

What version of Manifold is this?  2.12 is pretty old by Tika standards.  We 
pretty much upgrade Tika continuously at this point and if it's not the current 
version you are running old Tika code.




was (Author: kwri...@metacarta.com):
What version of Manifold is this?
We pretty much upgrade Tika continuously at this point and if it's not the 
current version you are running old Tika code.



> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory

2019-10-04 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1625:
---

Assignee: Karl Wright

> When processing a specific PDF Manifold goes out of memory
> --
>
> Key: CONNECTORS-1625
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1625
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.12
>Reporter: Donald Van den Driessche
>Assignee: Karl Wright
>Priority: Major
> Attachments: abd-serotec-antibodies-uk.pdf
>
>
> When processing attached file with manifoldcf 2.12, we keep getting an out of 
> memory error.
> When just parsing it throug Tika 1.18, no issues are being found.
> Can anyone look into it?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)