Jetty Config changes

2024-03-11 Thread ritika jain
Hi All, When Manifoldcf start with start.jar , it creates an entry in system's tmp folder ,but it does not automatically get cleaned when server/manifold stops On my Live server I am using dockerised environment , where we have a mechanism that restarts manifold whenever required , every time

Re: Manifoldcf -XML parsing error: Character reference "" is an invalid XML character.

2022-12-22 Thread ritika jain
Can anybody provide any clue on this. Would be of great help On Thu, Dec 22, 2022 at 5:33 PM ritika jain wrote: > Hi all, > > I am using Manifoldcf 2.21 version with Windows shares connector and > Output as Elastic. > I am facing this error while clicking "List all jobs

Manifoldcf -XML parsing error: Character reference "" is an invalid XML character.

2022-12-22 Thread ritika jain
Hi all, I am using Manifoldcf 2.21 version with Windows shares connector and Output as Elastic. I am facing this error while clicking "List all jobs", Manifoldcf, jobs are being run/create in such a way that our API is creating a manifold job object and thus creating/starting a job in manifold

Frequent error while window shares job

2022-08-22 Thread ritika jain
Hi All, I have a Windows shared job to crawl files from samba server, it's a huge job to crawl documents in millions(about 10). While running a job , we encounter two types of errors very frequently. 1) WARN 2022-08-19T17:17:05,175 (Worker thread '7') - JCIFS: Possibly transient exception

Job Service Interruption- and stops

2022-04-29 Thread ritika jain
Hi All, With the window shares connector, on the server I am getting this exception and due to repeated service interruption *job stops.* Error: Repeated service interruptions - failure processing document: The process cannot access the file because it is being used by another process. How we

Log4j Update Doubt

2022-03-15 Thread ritika jain
Hi, How manifoldcf uses log4j files in bin directory/distribution. If this is the location "D:\\Manifoldcf\apache-manifoldcf-2.14\lib" that is the lib folder only.(for physical file presence) Also if the log4j dependency issue has been resolved and the version 2.15 or higher is updated, then

Manifoldcf freezes and sit idle

2022-01-31 Thread ritika jain
Hi, I am using Manifoldcf 2.14, web connector and Elastic as output. I have observed after a certain time period of continuous run job freezes and does not do/process anything. Simple history shows nothing after a certain process, and it's not for one job it has been observed for 3 different jobs

Log4j dependency

2021-12-14 Thread ritika jain
.Hi All, How does manifold.cf use log4j. When I checked pom.xml of ES connector , it is shown as an *exclusion *of maven dependency. [image: image.png] But when checked in Project's downloaded Dependencies, It shows it being used and downloaded. [image: image.png] How does manifold use log 4j

Two profiles of manifoldcf

2021-12-03 Thread ritika jain
Hi All, Can we create two different username/password of crawler UI of manifoldcf. I tried configuring two user profiles in properties.xml, but it's not working. Is there a way to do that? Thanks Ritika

Manifoldcf background process

2021-11-17 Thread ritika jain
Hi All, I would like to understand the background process of Manifoldcf windows shares jobs , and how it processes the path mentioned in the jobs configuration. I am creating a dynamic job via API using PHP which will pick up approx 70k of documents and a dynamic job with 70k of different paths

Re: Manifold Job process isssue

2021-11-15 Thread ritika jain
eue. > Karl > > > On Tue, Nov 9, 2021 at 7:08 AM ritika jain > wrote: > >> I have checked, there is only one hour time difference between docker >> container and docker host >> >> On Tue, Nov 9, 2021 at 4:41 PM Karl Wright wrote: >> >>> If

Re: Manifold Job process isssue

2021-11-09 Thread ritika jain
F uses > that to manage throttling etc. I don't know if that is the correct > explanation but it's the only thing I can think of. > > Karl > > > On Tue, Nov 9, 2021 at 4:56 AM ritika jain > wrote: > >> >> Hi All, >> >> I am using window share

Manifold Job process isssue

2021-11-09 Thread ritika jain
Hi All, I am using window shares connector , manifoldcf 2.14 and ES as output. I have configured a job to process 60k of documents, Also these documents are new and do not have corresponding values in DB and ES index. So ideally it should process/Index the documents as soon as the job starts.

Re: Duplicate key error

2021-10-27 Thread ritika jain
So , it can be left as it is.. ? because it is preventing job to complete and its stopping. On Tue, Oct 26, 2021 at 8:40 PM Karl Wright wrote: > That's a database bug. All of our underlying databases have some bugs of > this kind. > > Karl > > > On Tue, Oct 26, 2021 a

[no subject]

2021-10-26 Thread ritika jain
Hi All, While using Manifoldcf 2.14 with Web connector and ES connector. After a certain time of continuing the job (jobs ingest some documents in lakhs), we got this error on PROD. Can anybody suggest what could be the problem? PRODUCTION MANIFOLD ERROR: Error: ERROR: duplicate key value

Windows Shares job-Limit on defining no of paths

2021-10-25 Thread ritika jain
Hi , Is there any limit on the number of paths we can define in job using Repository as Window Shares and ES as Output Thanks

Null Pointer Exception

2021-10-25 Thread ritika jain
Hi, I am getting Null pointer exceptions while creating a job programmatic approach via PHP. Can anybody suggest the reason for this?. Error 500 Server Error HTTP ERROR 500 Problem accessing /mcf-api-service/json/jobs. Reason: Server ErrorCaused by:java.lang.NullPointerException at

Tika Parser Issue

2021-09-07 Thread ritika jain
Hi All, I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika dependencies in Manifoldcf 2.14 version. Getting some issues while parsing *PDF *files. Some strange characters appeared, tried changing Tika jar files version also 1.24 and 1.27 (it didn't even extract files correctly).

Re: Query:JCIFS connector

2021-08-23 Thread ritika jain
Can anybody have a clue on this ? On Fri, Aug 20, 2021 at 12:33 PM ritika jain wrote: > Hi All, > > I am having a query , is there any way using which we can mention > subdirectories' path in the file spec of Window shares connector. > > Like my requirement is to mention Mos

Query:JCIFS connector

2021-08-20 Thread ritika jain
Hi All, I am having a query , is there any way using which we can mention subdirectories' path in the file spec of Window shares connector. Like my requirement is to mention Most top hierarchical folder on top as mentioned in Screenshot. And in file spec requirement is to mention file name

Job Deletion query

2021-08-12 Thread ritika jain
Hi All, When we delete a job in Manifoldcf .. Does it also delete the indexed documents via that job from Elastic index as well ? I understand that when a job is deleted from Manifoldcf interface it will delete all the referenced documents via that job from postgres. But why is it deleted from

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain
Seems to be working now!!! Thanks a lot !!! On Wed, Aug 11, 2021 at 6:22 PM ritika jain wrote: > Hi , > > Yes this works only the difference is when a single file is ingested we > are having ingested one as C:/Users/Dell/Desktop/abc.txt/.-with a UNWANTED > slash at end > >

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain
file name. > > Karl > > > On Wed, Aug 11, 2021 at 2:14 AM ritika jain > wrote: > >> *Dynamic Job * >> >> {"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description"

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain
ute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\

Window shares dynamic Job issue

2021-08-10 Thread ritika jain
Hi All, I am using Window shares connector in 2.14 manifoldcf version and Elastic as output. I have created a dynamic manifoldcf job API via which a job will be created in manifoldcf with inclusions list and path, only particular file path is to be mentioned . Example file path:-

Re: Manifoldcf Redirection process

2021-05-28 Thread ritika jain
> > Is the process the same when fetch/process status code returned is 302 ? >>> When running a job with web crawler and ES output connector >>> >> can anybody have a clue about this

Re: Manifoldcf Redirection process

2021-05-20 Thread ritika jain
L onto the document > queue. > When it gets to the new URL, it processes it like any other. > > Karl > > > On Wed, May 19, 2021 at 8:32 AM ritika jain > wrote: > >> Hi >> >> I want to understand the process of "How does manifold.cf handles

Manifoldcf Redirection process

2021-05-19 Thread ritika jain
Hi I want to understand the process of "How does manifold.cf handles redirection of URL." in case of web crawler connector If there is a page redirect (through a 301) to another URL, then the next crawl will detect the redirect and index the new (final) URL and display it in the search results.

Re: Interrupted while acquiring credits

2021-05-14 Thread ritika jain
Karl Wright wrote: > ' > > *JCIFS: Possibly transient exception detected on attempt 1 while getting > share security'Yes, it is going to retry.* > > *Karl* > > On Fri, May 14, 2021 at 1:45 AM ritika jain > wrote: > >> Hi, >> I am using Windows shares conn

Interrupted while acquiring credits

2021-05-13 Thread ritika jain
Hi, I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch connector as Output connector and Tika and Metadata adjuster as Transformation connector Trying to crawl the files from SMB server with 64 GB of server and Start option file of manifoldcf is being given 32GB of memory

Manifoldcf Deletion Process

2021-03-30 Thread ritika jain
Hi All, I want to understand the process of Manifoldcf Deletion . i.e in which all cases Deletion process (When checked in Simple History) executes. One case as per my knowledge , is the one whenever Seed URL of a particular job is changed. What all are the cases when Deletion process runs. My

Re: Job Content Length issue

2021-02-17 Thread ritika jain
loaded into memory. That is why we require you to fill in a Solr field > on those kind of output connections that limits the number of bytes. > > Karl > > > On Tue, Feb 16, 2021 at 8:45 AM ritika jain > wrote: > >> >> >> Hi users >> >> >> I a

Job Content Length issue

2021-02-16 Thread ritika jain
Hi users I am using manifoldcf 2.14 Fileshare connector to crawl files from smb server which is having some millions billions of records to process and crawl. Total system memory is 64Gb out of which start options file of manifold is defined as 32GB. We have some larger files to crawl around

content length tab

2021-02-15 Thread ritika jain
Hi Users, Can anybody tell me if this can be filled as bytes or kilobytes here. The "Content Length tab looks like this: [image: Windows Share Job, Content Length tab] Values are to be filled as 100 , will this be 100 bytes or 100 kilobytes or in MB. Thanks Ritika

Re: Indexation Not OK

2020-12-31 Thread ritika jain
Elastic search output connector with some custom changes for some fields On Thursday, December 31, 2020, Karl Wright wrote: > Hi, > Can you let us know what you are using for the output connector? > Thanks, > Karl > > > On Thu, Dec 31, 2020 at 8:24 AM ritika jain > wr

Indexation Not OK

2020-12-31 Thread ritika jain
Hi, I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions of records into elastic search I am facing an issue in which when Job is run some time, successful indexation happens but after sometime , manifoldcf loops the records and Indexation is not getting OK. [image: image.png]

Mentioning/Connect to more than one Server

2020-09-02 Thread ritika jain
Hi all, Is there is a way to connect/mention more than one server in Elastic search Output connector in server location(URL) field [image: image.png] Thanks Ritika

Re: WebCrawler Connector code

2020-07-07 Thread ritika jain
ion and abstract method that > would provide a shim for most connectors. > > Karl > > > > > > > On Mon, Jul 6, 2020 at 8:52 AM ritika jain > wrote: > >> Hi All, >> >> I have confusion regarding WebCrawler connector code.My requirement

WebCrawler Connector code

2020-07-06 Thread ritika jain
Hi All, I have confusion regarding WebCrawler connector code.My requirement is to abort a job whenever a seed-corresponding site is down or returning some 5xx response codes. So I have used the jobManager errorAbort method for this in addSeedDocuments method of Webcrawlerconnector.java..,

Window shares job-Error ERROR: invalid byte sequence for encoding "UTF8": 0x00

2020-06-03 Thread ritika jain
Hi All, I am using Window's shares connector and output connector as ES and Postgres as database in Manifoldcf 2.14. Job is to crawl almost 20lakhs of records. When checked logs got this error:- * Worker thread aborting and restarting due to database connection reset: Database exception:

Re: Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-21 Thread ritika jain
connecting? (a network issue) Thanks Ritika On Tue, May 19, 2020 at 2:39 PM Karl Wright wrote: > I commented in the ticket you created. > Thanks, > Karl > > On Tue, May 19, 2020 at 3:07 AM ritika jain > wrote: > >> Hi All, >> >> I am configured Units job (Ma

Error: Repeated service interruptions - failure processing document: Failed to acquire credits in time

2020-05-19 Thread ritika jain
Hi All, I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres 9.6.10) on server to access files from samba SMBv3 server and used jcifs-ng-2.1.2.jar to be loaded in lib of manifoldcf. After ingesting some records into the index , the got this error in logs :- :-Unrecognized

Re: Extraction and storing parent URL while crawling

2020-05-11 Thread ritika jain
Hello Users, Can anybody please revert on this. It would be highly appreciated. On Fri, Apr 3, 2020 at 2:28 PM ritika jain wrote: > Hi All, > I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo > connector and Elastic Search as output connector, > I want

Re: Crawling / Indexation Query

2020-05-07 Thread ritika jain
those documents will be > removed from the index. > > But you can override the robots behavior in the document specification or > configuration, I believe. > > Karl > > > On Thu, May 7, 2020 at 6:27 AM ritika jain > wrote: > >> Hi All, >> >> Can any

Crawling / Indexation Query

2020-05-07 Thread ritika jain
Hi All, Can any body explain If a URL was indexed, and afterwards a noindex tag was added - will that URL then be deleted from the index when it is visited again by the crawler? Say a url was previously having indexation required meta tag and was present in Elastic index, but then afterwards

ES 7.6.2

2020-05-07 Thread ritika jain
Hi, Can any body tell me please whether Manifoldcf 2.14 version is compatible with Elastic Search Version 7.6.2 as it requires Java 11. Thanks Ritika

Re: Illegal Seed URL

2020-05-06 Thread ritika jain
ou could get that. Have you started > manifoldcf in debug mode? If so, what’s the output just before that > statement in the logs? > > > > -- > > Michael Cizmar > > > > *From: *ritika jain > *Reply-To: *"user@manifoldcf.apache.org" >

Illegal Seed URL

2020-05-05 Thread ritika jain
Hi All, I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic Search. I have mentioned a seed URL which is valid as it is opening successfully in browser. Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en

Extraction and storing parent URL while crawling

2020-04-03 Thread ritika jain
Hi All, I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo connector and Elastic Search as output connector, I want to get some knowledge about the crawling framework/hierarchy used by the webcrawler. As far as I know or I understand the crawling of the URL's works in the

Way to extract api object from existing job

2020-02-27 Thread ritika jain
Hi, Is there any way to extract Manifodcf API based Json object of the manually created job from interface For example:- Job_public has been created in Manifoldcf User interface version(2.14), how can we extract all the configuration is API based object as suggested on (

Extraction of related links

2020-02-12 Thread ritika jain
Hi All, I am using Manifoldcf 2.12, Repository as Web connector and Output as ES. As per requirement now, I want to save all related sub-links of a particular document Identifier(at a time). For example :-DocumentId::- www.xyz.com, so I would like to extract all related sublinks say:-