Re: Technical question on repo connector dev
Hi Julien, The checkDocumentNeedsReindexing() method is meant to be used inside processDocuments() for the specific document you are checking. So you can convert your URI to a set of JSON documents, if the document identifier is a URI, But you will probably want to put the actual data for the document in carrydown information. You will need to also create some kind of non-URI document ID too. Karl On Fri, Oct 4, 2019 at 1:36 PM wrote: > Hi, > > > > I am facing a simple technical case that I am not sure how to deal with, > concerning the development of a repository connector. > > > > I want to develop a repo connector using the ADD_CHANGE_DELETE model that > will normally add seed documents, and each seed document will produce > several documents. > The problem is that each produced document from a seed doc is instantly > ingest-able and does not need to be processed. > > > > The use case here is that the addSeedDocuments method will call an API that > will provide several URIs (seeds). > > In the processDocuments method, each URI provides a JSON array containing > JSON objects and those JSON objects are meant to become repository > documents > and ingested. > So the logic would be to use the activities.addDocumentReference for each > JSON object before I can use the activities.checkDocumentNeedsReindexing > (each JSON object has an id and a version field) and then ingest the > document. But by doing this, I am afraid that the processDocuments method > will be called with those newly referenced docs while they do not need to > be > processed. > > > > Any suggestion about how to deal with this use case is welcome. > > > > Thanks, > Julien > >
Technical question on repo connector dev
Hi, I am facing a simple technical case that I am not sure how to deal with, concerning the development of a repository connector. I want to develop a repo connector using the ADD_CHANGE_DELETE model that will normally add seed documents, and each seed document will produce several documents. The problem is that each produced document from a seed doc is instantly ingest-able and does not need to be processed. The use case here is that the addSeedDocuments method will call an API that will provide several URIs (seeds). In the processDocuments method, each URI provides a JSON array containing JSON objects and those JSON objects are meant to become repository documents and ingested. So the logic would be to use the activities.addDocumentReference for each JSON object before I can use the activities.checkDocumentNeedsReindexing (each JSON object has an id and a version field) and then ingest the document. But by doing this, I am afraid that the processDocuments method will be called with those newly referenced docs while they do not need to be processed. Any suggestion about how to deal with this use case is welcome. Thanks, Julien
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944414#comment-16944414 ] Donald Van den Driessche commented on CONNECTORS-1625: -- We are running this pdf as the one and only document. It's manifold 2.12. We tried to parse it through Tika locally with Tika 1.18 and 1.22 and both succeeded. We've set the heap space to 3G and 5G and still the same issues. I've now read somewhere that disk space might be used. But since the file is only 21MB large, I don't see how much disk space might be used. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944386#comment-16944386 ] Karl Wright commented on CONNECTORS-1625: - Also, FWIW, the default Java memory sizes on the example are not guaranteed to allow processing of N simultaneous Tika extractions (one per worker thread) of the sort that require more memory. Memory sizes allocated to the JVM are settable in the start-options files, and the first thing you want to do is increase those values to see if the problem goes away for you. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944385#comment-16944385 ] Karl Wright edited comment on CONNECTORS-1625 at 10/4/19 10:29 AM: --- What version of Manifold is this? 2.12 is pretty old by Tika standards. We pretty much upgrade Tika continuously at this point and if it's not the current version you are running old Tika code. was (Author: kwri...@metacarta.com): What version of Manifold is this? We pretty much upgrade Tika continuously at this point and if it's not the current version you are running old Tika code. > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (CONNECTORS-1625) When processing a specific PDF Manifold goes out of memory
[ https://issues.apache.org/jira/browse/CONNECTORS-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1625: --- Assignee: Karl Wright > When processing a specific PDF Manifold goes out of memory > -- > > Key: CONNECTORS-1625 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1625 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.12 >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > Attachments: abd-serotec-antibodies-uk.pdf > > > When processing attached file with manifoldcf 2.12, we keep getting an out of > memory error. > When just parsing it throug Tika 1.18, no issues are being found. > Can anyone look into it? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005)