Re: sharepoint crawler documents limit

2019-12-20 Thread Karl Wright
The code seems correct and many people are using it without encountering this problem. There may be another SharePoint configuration parameter you also need to look at somewhere. Karl On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia wrote: > > Hi Karl, > On sharepoint the list view

Re: sharepoint crawler documents limit

2019-12-20 Thread Jorge Alonso Garcia
Hi Karl, On sharepoint the list view threshold is 150,000 but we only receipt 20,000 from mcf [image: image.png] Jorge Alonso Garcia El jue., 19 dic. 2019 a las 19:19, Karl Wright () escribió: > If the job finished without error it implies that the number of documents > returned from this

Re: Manifoldcf server Error

2019-12-20 Thread Markus Schuch
Hi Priya, the container you trying to interactivily executing a command with is no longer running. It is not possible to execute command with stopped containers. The logger issues might be related to missing file system permissions. But thats a wild guess. Is there a "Caused by" part in the

Re: Manifoldcf server Error

2019-12-20 Thread Priya Arora
Hi All, When i am trying to execute bash command inside manifoldcf container getting error. [image: image.png] And when checking logs Sudo docker logs 2019-12-19 18:09:05,848 Job start thread ERROR Unable to write to stream logs/manifoldcf.log for appender MyFile 2019-12-19 18:09:05,848 Seeding

Re: Manifoldcf server Error

2019-12-20 Thread Priya Arora
Hi Markus, Many thanks for your reply!!. I tried this approach to reproduce the scenario in a different environment, but the case where I listed the error above is when I am crawling INTRANET sites which can be accessible over a remote server. Also I have used Transformation connectors:-Allow

Re: Manifoldcf server Error

2019-12-20 Thread Markus Schuch
Hi Priya, in my experience, i would focus on the OutOfMemoryError (OOME). 8 Gigs can be enough, but they don't have to. At first i would check if the jvm is really getting the desired heap size. The dockered environment make that a little harder find find out, since you need to get access to the

Re: Manifoldcf server Error

2019-12-20 Thread Priya Arora
Hi Markus , Heap size defined is 8GB. Manifoldcf start-options-unix file Xmx etc parameters is defined to have memory 8192mb. It seems to be an issue with memory also, and also when manifoldcf tries to communicate to Database. Do you explicitly define somewhere connection timer when to

Re: Manifoldcf server Error

2019-12-20 Thread Markus Schuch
Hi Priya, your manifoldcf JVM suffers from high garbage collection pressure: java.lang.OutOfMemoryError: GC overhead limit exceeded What is your current heap size? Without knowing that, i suggest to increase the heap size. (java -Xmx...) Cheers, Markus Am 20.12.2019 um 09:02 schrieb Priya

Re: sharepoint crawler documents limit

2019-12-19 Thread Jorge Alonso Garcia
Hi, The job finnish ok (several times) but always with this 2 documents, for some reason the loop only execute twice Jorge Alonso Garcia El jue., 19 dic. 2019 a las 18:14, Karl Wright () escribió: > If the are all in one document, then you'd be running this code: > > >> > int

Re: sharepoint crawler documents limit

2019-12-19 Thread Karl Wright
If the are all in one document, then you'd be running this code: >> int startingIndex = 0; int amtToRequest = 1; while (true) { com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult itemsResult =

Re: sharepoint crawler documents limit

2019-12-19 Thread Karl Wright
If you are using the MCF plugin, and selecting the appropriate version of Sharepoint in the connection configuration, there is no hard limit I'm aware of for any Sharepoint job. We have lots of other people using SharePoint and nobody has reported this ever before. If your SharePoint connection

Re: sharepoint crawler documents limit

2019-12-19 Thread Jorge Alonso Garcia
Hi, On UI shows 20,002 documents (on a firts phase show 10,001,and after sometime of process raise to 20,002) . It looks like a hard limit, there is more files on sharepoint with the used criteria Jorge Alonso Garcia El jue., 19 dic. 2019 a las 16:05, Karl Wright () escribió: > Hi Jorge, > >

Re: sharepoint crawler documents limit

2019-12-19 Thread Karl Wright
Hi Jorge, When you run the job, do you see more than 20,000 documents as part of it? Do you see *exactly* 20,000 documents as part of it? Unless you are seeing a hard number like that in the UI for that job on the job status page, I doubt very much that the problem is a numerical limitation in

Re: sharepoint crawler documents limit

2019-12-19 Thread Jorge Alonso Garcia
Hi Karl, We had installed the shaterpoint plugin, and access properly http:/server/ _vti_bin/MCPermissions.asmx [image: image.png] Sharepoint has more than 20,000 documents, but when execute the jon only extract these 20,000. How Can I check where is the issue? Regards Jorge Alonso Garcia

Re: sharepoint crawler documents limit

2019-12-19 Thread Karl Wright
By "stop at 20,000" do you mean that it finds more than 20,000 but stops crawling at that time? Or what exactly do you mean here? FWIW, the behavior you describe sounds like you may not have installed the SharePoint plugin and may have selected a version of SharePoint that is inappropriate. All

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-17 Thread Karl Wright
Found the problem: needed to update a pom dependency. Everything passes now. Karl On Tue, Dec 17, 2019 at 8:07 PM Karl Wright wrote: > I just created a plugin directory at > https://svn.apache.org/repos/asf/manifoldcf/integration/solr-8.x/trunk . > Code committed there builds but it doesn't

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-17 Thread Karl Wright
I just created a plugin directory at https://svn.apache.org/repos/asf/manifoldcf/integration/solr-8.x/trunk . Code committed there builds but it doesn't test properly because of the following exception: >> [ERROR] Failed to execute goal

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-17 Thread Jörn Franke
Here you find it: https://issues.apache.org/jira/browse/CONNECTORS-1629 I will try it out this year I hope. I will try it though with Solr 8.3.1 and will take into account https://issues.apache.org/jira/browse/CONNECTORS-1586 On Tue, Dec 17, 2019 at 1:09 PM Karl Wright wrote: > Please do! >

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-17 Thread Karl Wright
Please do! Karl On Tue, Dec 17, 2019 at 7:06 AM Jörn Franke wrote: > Thanks a lot Karl for your feedback. Do you mind if I create a Jira where > I report on the progress? > > Am 17.12.2019 um 12:22 schrieb Karl Wright : > >  > Well, you can certainly attempt this simply enough then if you

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-17 Thread Jörn Franke
Thanks a lot Karl for your feedback. Do you mind if I create a Jira where I report on the progress? > Am 17.12.2019 um 12:22 schrieb Karl Wright : > >  > Well, you can certainly attempt this simply enough then if you build from > source. I'd prefer that you validate the approach before we

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-17 Thread Karl Wright
Well, you can certainly attempt this simply enough then if you build from source. I'd prefer that you validate the approach before we make permanent commits. Please let me know what works and what doesn't. Karl On Tue, Dec 17, 2019 at 1:22 AM Jörn Franke wrote: > I agree. > The delegation

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-16 Thread Jörn Franke
I agree. The delegation part is not relevant for me. I also do not believe it makes sense at the ETL level. I think still we need add the one line of code that allows to use Kerberos (second line in the example). > Am 17.12.2019 um 01:35 schrieb Karl Wright : > >  > Hi Jorn, > > The code

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-16 Thread Karl Wright
Hi Jorn, The code referenced cannot be set up differently from connection to connection so there is no point in having this be anything other than global. In that case you can point at the config file with -D=value and it will do the same thing as setting a system property. The token delegation

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-16 Thread Jörn Franke
Thanks a lot for the quick reply. Actually it is here: https://lucene.apache.org/solr/guide/8_3/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr It is also available in the previous versions of Solr. I wonder how easy it would be to add a configuration to the Manifold UI to

Re: Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-16 Thread Karl Wright
The Solr Output Connector uses a patched HttpComponents/HttpClient for communication with the various Solr Cloud replicas, along with custom versions of some of the SolrJ classes which allow multipart posts to work. Other than that it's standard SolrJ. Whatever SolrJ needs to work with Kerberos,

Solr Output Connector: SolrCloud with Kerberos / Zookeeper with Kerberos

2019-12-16 Thread Jörn Franke
Hallo, does the Solr Output Connector support SolrCloud with Kerberos authentication and Zookeeper with Kerberos authentication? If so, how can this be configured? If it is not supported, is there an "easy" way to integrate this? From a development perspective the Kerberos Authentication with

Put A Document to arbitrary API

2019-12-09 Thread Kayak28
Hello, Manifold CF Community Members: I would like to put a document to a web API that accepts JSON documents from MCF. So, for example, in a directory, a lot of JSONs are stored. I want MCF to read the JSON from the directory ( with FileSystem Repository Connection) and send the document to API

Re: About Manifold CF API

2019-11-28 Thread Karl Wright
Hi Kaya, The best way to form proper JSON is to create a job with the UI and export its JSON, and use that as a model. Thanks, Karl On Thu, Nov 28, 2019 at 3:05 AM Kayak28 wrote: > Hello, Community Members: > > I have a question about the form of JSON when I call a job-creation API. > I would

About Manifold CF API

2019-11-28 Thread Kayak28
Hello, Community Members: I have a question about the form of JSON when I call a job-creation API. I would like to use the following API. jobs POST Create a job {"job":**} {"job_id":**} *OR* {"error":**} The URL I should send with curl POST command is:

Re: Continues Job

2019-11-26 Thread Sreejith Variyath
Thanks Karl. On Tue, Nov 26, 2019 at 1:39 PM Karl Wright wrote: > No, just changing the job characteristics will NOT cause the incremental > behavior to be erased. > > Karl > > > On Mon, Nov 25, 2019 at 10:20 PM Sreejith Variyath < > sreejith.variy...@tarams.com> wrote: > >> Yes. I understood.

Re: Continues Job

2019-11-26 Thread Karl Wright
No, just changing the job characteristics will NOT cause the incremental behavior to be erased. Karl On Mon, Nov 25, 2019 at 10:20 PM Sreejith Variyath < sreejith.variy...@tarams.com> wrote: > Yes. I understood. Thanks Karl. > > I have another question. If I update job type from

Re: Continues Job

2019-11-25 Thread Sreejith Variyath
Yes. I understood. Thanks Karl. I have another question. If I update job type from TYPE_SPECIFIED to TYPE_CONTINUOUS , Then the document versioning will reset and job will pick all the documents again?. On Tue, Nov 26, 2019, 05:12 Karl Wright wrote: > One of the characteristics of continuous

Re: Continues Job

2019-11-25 Thread Karl Wright
One of the characteristics of continuous jobs is that they call addSeedDocuments multiple times on a single job run. The job run never ends, so this is how the job picks up documents for the infinitely-running job. That's just the way it works. Have you read the book? Karl On Mon, Nov 25,

Continues Job

2019-11-25 Thread SREEJITH va
Hi Every One, I am trying to setup a job which is having a JDBC repository connector. One transformation connector and a custom output connector. I want this job needs to run in two mode. - Sample Mode : This is a sample migration mode. Job will pick 10 documents and migrate to output

Re: Manifoldcf version conflict

2019-11-19 Thread Karl Wright
I was incorrect. The value comes from one of the properties: Karl On Tue, Nov 19, 2019 at 6:16 AM Priya Arora wrote: > I am using docker commands to install manifoldcf inside docker container. > So what I understand is that mcf downloads latest crawler-ui.war files in > the web

Re: Manifoldcf version conflict

2019-11-19 Thread Priya Arora
I am using docker commands to install manifoldcf inside docker container. So what I understand is that mcf downloads latest crawler-ui.war files in the web folder(that is what i checked in the local system). Do I need to check somewhere else. [image: image.png] On Tue, Nov 19, 2019 at 4:40 PM

Re: Manifoldcf version conflict

2019-11-19 Thread Karl Wright
That version comes directly from the ant build version that was used to compile the UI. What version of crawler-ui.war do you have? Karl On Tue, Nov 19, 2019 at 5:50 AM Priya Arora wrote: > Hi All, > > I have upgraded manifoldcf version on the server to version 2.14, I > re-confirmed it via

Manifoldcf version conflict

2019-11-19 Thread Priya Arora
Hi All, I have upgraded manifoldcf version on the server to version 2.14, I re-confirmed it via docker build command that it is downloading 2.14 version only. [image: image.png] But when I am starting up manifold wythe version is showing me up 2.10 as shown in the screenshot above. Is there

Re: Specifications of HopFilters "Keep unreachable documents"

2019-11-13 Thread Kayak28
Hello, Mr. Karl, Mr. Issei, and Community members. I have a similar issue with Mr.Issei. Here is a sample website structure that I want to crawl with MCF. index.html -link to -> sample1.html -link to-> sample2.html I made this sample website to explore the behavior of "Hop count mode." The

Re: Windows shares connector-Error

2019-11-10 Thread Karl Wright
Can you do the following: >> C:\wip\mcf\trunk>dir lib\less* Volume in drive C is Windows Volume Serial Number is F4D8-E4E0 Directory of C:\wip\mcf\trunk\lib 09/06/2019 02:52 PM 1,304,630 less4j-1.17.2.jar 1 File(s) 1,304,630 bytes 0 Dir(s)

Re: Specifications of HopFilters "Keep unreachable documents"

2019-11-08 Thread Issei Nishigata
Hi Karl, Thank you for a quick response. It seems that I have completely misunderstood the specifications so it'd be helpful if you could show specific examples for each Hop count mode. Is those below my understanding correct? - "keep unreachable documents, for now" and "... forever" is the

Re: Windows shares connector-Error

2019-11-08 Thread Karl Wright
(1) Download source distribution and lib distribution (2) Unpack and follow directions for placing lib folder in place (3) Run 'ant make-deps' to download the correct version of jcifs (4) Run "ant build" to make a distribution that includes proprietary examples (5) Use the proprietary example you

Re: Windows shares connector-Error

2019-11-08 Thread Priya Arora
This didn't work even. Is that(manifoldcf version 2.14) something to do with java version also. If yes , I am using JAVA_HOME :_ java version 8. Can you suggest something On Fri, Nov 8, 2019 at 4:16 PM Sreejith Variyath < sreejith.variy...@tarams.com> wrote: > place the jcifs.jar into the

Re: Windows shares connector-Error

2019-11-08 Thread Sreejith Variyath
place the jcifs.jar into the *connector-lib-proprietary* directory On Fri, Nov 8, 2019 at 2:38 PM Priya Arora wrote: > Hi All > > I installed the 2.14 version of manifoldcf , then uncommented the line in > connectors.xml file "" , but when I try to start with(java- jar start.jar) gives error: >

Windows shares connector-Error

2019-11-08 Thread Priya Arora
Hi All I installed the 2.14 version of manifoldcf , then uncommented the line in connectors.xml file "" , but when I try to start with(java- jar start.jar) gives error: I also checked it mcf-jcifs-connector.jar is also present in connector-lib. Do i need to do something else also.Here is the

Re: Illegal transaction ID/parent transaction ID

2019-11-07 Thread SREEJITH va
Ok thanks Karl. On Fri, Nov 8, 2019, 02:20 Karl Wright wrote: > Have you tried deploying the combined war on tomcat instead? > > I honestly do not know what is wrong but if the combined war works you > have something to compare/contrast against. > > Karl > > > On Thu, Nov 7, 2019 at 2:45 PM

Re: Illegal transaction ID/parent transaction ID

2019-11-07 Thread Karl Wright
Have you tried deploying the combined war on tomcat instead? I honestly do not know what is wrong but if the combined war works you have something to compare/contrast against. Karl On Thu, Nov 7, 2019 at 2:45 PM SREEJITH va wrote: > Thanks Karl, Here is quick summary on how I embedded

Re: Illegal transaction ID/parent transaction ID

2019-11-07 Thread SREEJITH va
Thanks Karl, Here is quick summary on how I embedded Manifold in my application. - All the required manifold jar dependencies are in pom. - The properties.xml is served through org.apache.manifoldcf.configfile settings in catalina.properties - There is an application ready Lister

Re: Illegal transaction ID/parent transaction ID

2019-11-07 Thread Karl Wright
How are you embedding ManifoldCF in your application? What looks like is happening is that thread contexts are being lost somehow. ManifoldCF uses thread contexts to keep track of worker thread-local information, and it appears that you are calling into ManifoldCF code assuming that (for

Illegal transaction ID/parent transaction ID

2019-11-07 Thread SREEJITH va
Hi All, I have an spring based application in which Manifold is embedded and running in tomcat. At some point I am getting below exceptions. Any lead on why this happening would be greatly appreciated. One scenario in which I can see this in my logs is while shutting down the tomcat. And if it

Re: Window shares-Repository Type Source Code

2019-11-05 Thread Priya Arora
Many thanks!! On Wed, Nov 6, 2019 at 12:14 PM SREEJITH va wrote: > Yes. Exactly. > > On Wed, Nov 6, 2019 at 12:04 PM Priya Arora wrote: > >> Hi Sir, >> >> >> >> This connector code I am looking for , does JCIF connector is the same ? >> [image: image.png] >> >> Thanks >> Priya >> >> On Wed,

Re: Window shares-Repository Type Source Code

2019-11-05 Thread Priya Arora
Hi Sir, This connector code I am looking for , does JCIF connector is the same ? [image: image.png] Thanks Priya On Wed, Nov 6, 2019 at 11:58 AM SREEJITH va wrote: > I think you are searching for JCIFS connector. Its the windows repository > connector. Its in manifoldcf\connectors\jcifs > >

Re: Window shares-Repository Type Source Code

2019-11-05 Thread SREEJITH va
I think you are searching for JCIFS connector. Its the windows repository connector. Its in manifoldcf\connectors\jcifs On Wed, Nov 6, 2019 at 11:19 AM Priya Arora wrote: > Hi, > > I need to implement "Window Shares" type Repository connection type and > needs to access and understand code

Re: Manifoldcf - Job Deletion Process

2019-11-05 Thread Priya Arora
When I created a new job and followed the process of lifecycle/execution of identifier, then it didnt start the Deletion process. There was no any change in job configuration and in database start-up and configurations. On Sat, Nov 2, 2019 at 12:41 AM Priya Arora wrote: > No, I am not deleting

Window shares-Repository Type Source Code

2019-11-05 Thread Priya Arora
Hi, I need to implement "Window Shares" type Repository connection type and needs to access and understand code first. But I am unable to find its code at path:-

Re: Manifoldcf - Job Deletion Process

2019-11-01 Thread Priya Arora
No, I am not deleting the job after it run.. its status is getting updated as ‘Done’ after all process. Although the process involves indexation of documents and just before the job ends deletion process executed Sequence is fetch etc, indexation ,extract other processes , deletion then job

Re: Manifoldcf - Job Deletion Process

2019-10-30 Thread Karl Wright
Ok, so pick ONE of these identifiers. What I want to see is the entire lifecycle of the ONE identifier. That includes what the Web Connection logs as well as what the indexation logs. Ideally I'd like to see: - job start and end - web connection events - indexing events I'd like to see these

Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
Indexation screenshot is as below. [image: image.png] On Tue, Oct 29, 2019 at 7:57 PM Karl Wright wrote: > I need both ingestion and deletion. > Karl > > > On Tue, Oct 29, 2019 at 8:09 AM Priya Arora wrote: > >> History is shown as below as it does not indicates any error. >> [image: 12.JPG]

Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Karl Wright
I need both ingestion and deletion. Karl On Tue, Oct 29, 2019 at 8:09 AM Priya Arora wrote: > History is shown as below as it does not indicates any error. > [image: 12.JPG] > > Thanks > Priya > > On Tue, Oct 29, 2019 at 5:02 PM Karl Wright wrote: > >> What does the history say about these

Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
History is shown as below as it does not indicates any error. [image: 12.JPG] Thanks Priya On Tue, Oct 29, 2019 at 5:02 PM Karl Wright wrote: > What does the history say about these documents? > Karl > > On Tue, Oct 29, 2019 at 6:53 AM Priya Arora wrote: > >> >> it may be that (a) they

Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Karl Wright
What does the history say about these documents? Karl On Tue, Oct 29, 2019 at 6:53 AM Priya Arora wrote: > > it may be that (a) they weren't found, or (b) that the document > specification in the job changed and they are no longer included in the job. > > URL's that were deleted are valid

Re: Manifold with OpenJDK

2019-10-18 Thread SREEJITH va
Hi, JAVA_HOME is set to /usr/lib/jvm/java-8-openjdk-amd64 On Fri, Oct 18, 2019 at 11:26 AM Priya Arora wrote: > Hi Sreejith, > > Can you please let me know,,JAVA_HOME variable value you have set. > > Thanks > Priya > > On Thu, Oct 17, 2019 at 12:17 PM SREEJITH va > wrote: > >> Hi, We use

Re: Manifold with OpenJDK

2019-10-17 Thread Priya Arora
Hi Sreejith, Can you please let me know,,JAVA_HOME variable value you have set. Thanks Priya On Thu, Oct 17, 2019 at 12:17 PM SREEJITH va wrote: > Hi, We use manifold with openjdk version "1.8.0_222" in Centos and did not > faced any issues. > > On Thu, Oct 17, 2019 at 12:04 PM Bisonti Mario

Re: Manifold with OpenJDK

2019-10-17 Thread SREEJITH va
Hi, We use manifold with openjdk version "1.8.0_222" in Centos and did not faced any issues. On Thu, Oct 17, 2019 at 12:04 PM Bisonti Mario wrote: > Hallo, I use Ubuntu 18.04.02 LTS with: > openjdk version "11.0.4" 2019-07-16 > > > > And I have no issue with ManifoldCF > > > > Mario > > > >

R: Manifold with OpenJDK

2019-10-17 Thread Bisonti Mario
Hallo, I use Ubuntu 18.04.02 LTS with: openjdk version "11.0.4" 2019-07-16 And I have no issue with ManifoldCF Mario Da: Markus Schuch Inviato: giovedì 17 ottobre 2019 07:35 A: user@manifoldcf.apache.org; Praveen Bejji Oggetto: Re: Manifold with OpenJDK Hi Praveen, we use openjdk 8 in

Re: Manifold with OpenJDK

2019-10-16 Thread Markus Schuch
Hi Praveen, we use openjdk 8 in dockered red hat linux for 2 years now and didn't have problems with it. We had one minor issue when we migrated: the image processing capabilities of openjdk are somehow different from Oracle JDK. One of our connectors creates image thumbnails and on openjdk

Re: Manifold with OpenJDK

2019-10-16 Thread Karl Wright
I use it this way all the time. Karl On Wed, Oct 16, 2019 at 11:32 AM Praveen Bejji wrote: > Hi, > > We are planning on using ManifoldCF with Open JDK 1.8 on Linux server. > Can you please let us know if there are any known issues/challenges on > using ManifldCF with Open JDK? > > > Thanks, >

Manifold with OpenJDK

2019-10-16 Thread Praveen Bejji
Hi, We are planning on using ManifoldCF with Open JDK 1.8 on Linux server. Can you please let us know if there are any known issues/challenges on using ManifldCF with Open JDK? Thanks, Praveen

Re: Box connector

2019-10-12 Thread Karl Wright
If there is such a connector, I don't know about it. Hopefully we'll find out soon if somebody has developed one on the outside they're willing to contribute or make available. Karl On Fri, Oct 11, 2019 at 2:17 PM SREEJITH va wrote: > Hi, I am working on a document migration project, which

Box connector

2019-10-11 Thread SREEJITH va
Hi, I am working on a document migration project, which requires to migrate documents to Box( https://www.box.com/) system. Do we have any output connector exist for box system or any development in progress?

Re: Job Multiple Outputs

2019-09-10 Thread julien.massiera
Thanks for your answer Karl. I was unsure about that concerning the output connections but it is still the same pipeline after all. Message d'origine De : Karl Wright Date : 10/09/2019 20:08 (GMT+01:00) À : user@manifoldcf.apache.org Objet : Re: Job Multiple Outputs Hi

Re: Job Multiple Outputs

2019-09-10 Thread Karl Wright
Hi Julien, You must understand that a job with a complex pipeline is really not running N independent jobs; it's running ONE job. Every document is processed through the pipeline only once. The pipeline may have faster components and slower components; doesn't matter; the document takes the sum

Re: Job Multiple Outputs

2019-09-10 Thread Julien Massiera
Ok, so to be sure I understood what you are saying: suppose a job with two output connections and one of the outputs is twice time faster than the other one to index documents. At a given time t, both of the outputs will have indexed the same amount of documents, no matter if one output is

Re: Job Multiple Outputs

2019-09-10 Thread Karl Wright
The output connection contract is that a request to index is made to the connector, and the connector returns when it is done. When there are multiple output connections, these are each handed a copy of the document, one after the other, and told to index it. This is all done by one worker

Job Multiple Outputs

2019-09-10 Thread Julien Massiera
Hi, I would like to have an explanation about the behavior of a job when several outputs are configured. My main question is : for each output, how is the docs ingestion managed ? More precisely, are the ingest processes synchronized or not ? (in other words, is the ingestion of the next

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
Hi Karl, yes, this helps. The webpage is now ingested after tika extraction and i only have to include the mime type text/html in the solr output connection. Many thanks. Cheers Markus Am 23.08.2019 um 13:45 schrieb Karl Wright: > Created a ticket: CONNECTORS-1621.  Added a fix.  Please let

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
Created a ticket: CONNECTORS-1621. Added a fix. Please let me know if it resolves the problem for you. Thanks, Karl On Fri, Aug 23, 2019 at 7:33 AM Karl Wright wrote: > Hi Markus, > > You are correct. > This code was added as part of > https://issues.apache.org/jira/browse/CONNECTORS-1482 .

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
Hi Markus, You are correct. This code was added as part of https://issues.apache.org/jira/browse/CONNECTORS-1482 . The code that was added does look at the content mime type. The reason that the mime type is not modified in the document being passed to Solr by Tika is because we want Solr to

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
I already have "update" in the handler field. One can see that in the gist link i posted and it is not working. The HttpPoster of the SolrConnector takes RepositoryDocument.getMimeType() and checks the mime type against the hardcoded plain text mime type list, if solr cell mode (extracting

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Karl Wright
There are two possible ways to configure Tika with Solr. First way: Tika extractor + Solr update handler Second way: no Tika extractor + Solr update/extract handler For the first way, the Solr Connector completely ignores any "accepted mime types" you set for it, and only accepts text/plain. For

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-23 Thread Markus Schuch
Hi Karl, what do i have to do to make tika declare the extracted plain text with mime type text/plain in my setup? As i said, i have a tika extractor in place: Pipeline: 1) Webcrawler Connector (Repository Connection) 2) Tika Extractor (Transformation) 3) Solr Connector (Output

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

2019-08-22 Thread Karl Wright
Hi Markus, If you use the straight update handler, with no Tika filter, then the Solr Connector by design restricts input to textual documents. We can perhaps broaden that to web pages but then you will be indexing HTML tags as well and I rather doubt that's what you want. If you run Tika

Re: Solr Repository Connector

2019-08-14 Thread Dileepa Jayakody
Hi Rafa, Thank you for your valuable suggestions. On Tue, Aug 13, 2019 at 5:25 PM Rafa Haro wrote: > Hi Dileepa, > > IMHO, Furkan's approach makes the most sense here. As Olivier pointed out, > to retrieve the original content from a Lucene based index, all the fields > you are interested in

Re: Solr Repository Connector

2019-08-06 Thread Dileepa Jayakody
Hi All, Thank you for your replies. @Furkan, Olivier, thanks for the pointers. I will check the approach of the Solr repository connector as per given references. @Olivier if you can contribute the Solr repo-connector you are working on, to MCF that will be awesome! Will be looking forward to an

Re: Solr Repository Connector

2019-08-05 Thread Olivier Tavard
Hello, We are currently working on this kind of repository connector for a customer. We plan to give the code to the MCF project if the customer lets us do it legally. We will know it at the end of the month or at the beginning of next month. In order to have this working, all the fields of

Re: Solr Repository Connector

2019-08-05 Thread Furkan KAMACI
Hi Dileepa, Writing a custom repository connector can let you achieve your goal. Read and directly write to an output connector. You should check your requirements i.e. which data sources you will connect. MCF may rid of huge integration pains compared to many other ETL tools in your case. On

Re: Solr Repository Connector

2019-08-05 Thread Karl Wright
I would strongly suggest going directly to the repositories rather than the Solr index, where possible, as the source for the documents you are indexing. This is MCF's standard use case. It is meant to handle disparate repositories all going into a single output. Effort is made in every

Re: Solr Repository Connector

2019-08-05 Thread Dileepa Jayakody
Hi Karl and all, In my use-case, one of the data-sources is an already populated Solr index which is an e-commerce web-site data index (customers, products & services). Apart from the Solr Index, I need to ingest several other heterogeneous data-sources such as PostgresSQL databases, CRM data etc

Re: Solr Repository Connector

2019-08-05 Thread Karl Wright
If you are trying to extract data from a Solr index, I know of no way to do that. Karl On Mon, Aug 5, 2019 at 9:08 AM Dileepa Jayakody wrote: > Hi All, > > Thanks for your replies. > I'm looking for a repository connector. I've used the Solr output > connector before. But now what I need is to

Re: Solr Repository Connector

2019-08-05 Thread Dileepa Jayakody
Hi All, Thanks for your replies. I'm looking for a repository connector. I've used the Solr output connector before. But now what I need is to connect to a solr index as a repository and retrieve the documents from there. So I need a Solr repository connector. @Karl I will look at the Solr

Re: Solr Repository Connector

2019-08-05 Thread Cihad Guzel
Hi Dileepa, You can check all MFC Connectors list from https://manifoldcf.apache.org/release/release-2.13/en_US/included-connectors.html MFC have a Solr Output Connector. It is not a repository connector. if you want to use as repository connector, you should write a new repository connector.

Re: Solr Repository Connector

2019-08-05 Thread Karl Wright
If you use Solr Cloud, ManifoldCF's Solr Connector should work for you. Karl On Mon, Aug 5, 2019 at 6:18 AM Dileepa Jayakody wrote: > Hi All, > > I'm working on a project which needs to implement a federated search > solution with heterogeneous data repositories. One repository is a Solr >

Solr Repository Connector

2019-08-05 Thread Dileepa Jayakody
Hi All, I'm working on a project which needs to implement a federated search solution with heterogeneous data repositories. One repository is a Solr index. I would like to use ManifoldCF as the data ingestion engine in this project as I have worked with MCF before. Does ManifoldCF has a Solr

How to re-index a part of documents ?

2019-07-30 Thread SAUNIER Maxence
Hello Karl, What can I modify exactly in DDB to communicate at ManifoldCF to reindex certain documents ? (Lines and Table) I would like him to go back to some documents according to certain scripted conditions. Thanks you

Re: Reg. Manifold Indexing performance

2019-07-17 Thread Karl Wright
Hi Praveen, If there is a broken query plan, it will show up in the ManifoldCF log; any query that takes more than 60 seconds to run gets dumped and explained. So it should be possible to rule that out with low effort. The kind of situation I have seen with very large document jobs is that

Reg. Manifold Indexing performance

2019-07-17 Thread Praveen Bejji
Hi, We are trying to index close to one million document using documentum connector. Indexing is working fine but we see a drop in indexing performance after first day. Connector is able to index 21k/hr on the first day but it drops to 10k/hr after 24-28 hours. Although we don't see any errors

Re: Reg. unstable Manifold instance

2019-07-17 Thread Praveen Bejji
Thanks Karl Yes, i do agree that looking at the Jetty logs should give us some clue. I will check and get back on this. On Tue, Jul 16, 2019 at 2:31 PM Karl Wright wrote: > Hosting on a different app server is something you could easily do. Or, > since this takes many months before it

Re: Reg. unstable Manifold instance

2019-07-16 Thread Karl Wright
Hosting on a different app server is something you could easily do. Or, since this takes many months before it appears, you might just live with it. But first, there should be access logs the Jetty writes to. It should be possible for you to see what's happening from those logs if you can find

Re: Reg. unstable Manifold instance

2019-07-16 Thread Praveen Bejji
@Michael, There are no error in the on the logs. The app just goes down abruptly. @Karl, Assuming that Jetty server has some issue, what do you suggest? Is hosting Manifold on some other server(say Tomcat ) an alternative? On Mon, Jul 15, 2019 at 9:04 AM Michael Cizmar wrote: > Are there

R: Documentum connection not working

2019-07-16 Thread Bisonti Mario
Hallo. Thanks..I didn’t read the documentation about the sidecar documentum process. Thanks a lot. . Da: Karl Wright Inviato: martedì 16 luglio 2019 13:20 A: user@manifoldcf.apache.org Oggetto: Re: Documentum connection not working Are you running the documentum connector sidecar processes?

<    1   2   3   4   5   6   7   8   9   10   >