Hi All,
When Manifoldcf start with start.jar , it creates an entry in system's tmp
folder ,but it does not automatically get cleaned when server/manifold
stops
On my Live server I am using dockerised environment , where we have a
mechanism that restarts manifold whenever required , every time
Can anybody provide any clue on this. Would be of great help
On Thu, Dec 22, 2022 at 5:33 PM ritika jain
wrote:
> Hi all,
>
> I am using Manifoldcf 2.21 version with Windows shares connector and
> Output as Elastic.
> I am facing this error while clicking "List all jobs
Hi all,
I am using Manifoldcf 2.21 version with Windows shares connector and Output
as Elastic.
I am facing this error while clicking "List all jobs", Manifoldcf, jobs
are being run/create in such a way that our API is creating a manifold job
object and thus creating/starting a job in manifold
Hi All,
I have a Windows shared job to crawl files from samba server, it's a huge
job to crawl documents in millions(about 10). While running a job , we
encounter two types of errors very frequently.
1) WARN 2022-08-19T17:17:05,175 (Worker thread '7') - JCIFS: Possibly
transient exception
Hi All,
With the window shares connector, on the server I am getting this exception
and due to repeated service interruption *job stops.*
Error: Repeated service interruptions - failure processing document: The
process cannot access the file because it is being used by another process.
How we
Hi,
How manifoldcf uses log4j files in bin directory/distribution. If this is
the location "D:\\Manifoldcf\apache-manifoldcf-2.14\lib" that is the lib
folder only.(for physical file presence)
Also if the log4j dependency issue has been resolved and the version 2.15
or higher is updated, then
Hi,
I am using Manifoldcf 2.14, web connector and Elastic as output.
I have observed after a certain time period of continuous run job freezes
and does not do/process anything. Simple history shows nothing after a
certain process, and it's not for one job it has been observed for 3
different jobs
.Hi All,
How does manifold.cf use log4j. When I checked pom.xml of ES connector , it
is shown as an *exclusion *of maven dependency.
[image: image.png]
But when checked in Project's downloaded Dependencies, It shows it being
used and downloaded.
[image: image.png]
How does manifold use log 4j
Hi All,
Can we create two different username/password of crawler UI of manifoldcf.
I tried configuring two user profiles in properties.xml, but it's not
working.
Is there a way to do that?
Thanks
Ritika
Hi All,
I would like to understand the background process of Manifoldcf windows
shares jobs , and how it processes the path mentioned in the jobs
configuration.
I am creating a dynamic job via API using PHP which will pick up approx 70k
of documents and a dynamic job with 70k of different paths
eue.
> Karl
>
>
> On Tue, Nov 9, 2021 at 7:08 AM ritika jain
> wrote:
>
>> I have checked, there is only one hour time difference between docker
>> container and docker host
>>
>> On Tue, Nov 9, 2021 at 4:41 PM Karl Wright wrote:
>>
>>> If
F uses
> that to manage throttling etc. I don't know if that is the correct
> explanation but it's the only thing I can think of.
>
> Karl
>
>
> On Tue, Nov 9, 2021 at 4:56 AM ritika jain
> wrote:
>
>>
>> Hi All,
>>
>> I am using window share
Hi All,
I am using window shares connector , manifoldcf 2.14 and ES as output. I
have configured a job to process 60k of documents, Also these documents are
new and do not have corresponding values in DB and ES index.
So ideally it should process/Index the documents as soon as the job starts.
So , it can be left as it is.. ? because it is preventing job to complete
and its stopping.
On Tue, Oct 26, 2021 at 8:40 PM Karl Wright wrote:
> That's a database bug. All of our underlying databases have some bugs of
> this kind.
>
> Karl
>
>
> On Tue, Oct 26, 2021 a
Hi All,
While using Manifoldcf 2.14 with Web connector and ES connector. After a
certain time of continuing the job (jobs ingest some documents in lakhs),
we got this error on PROD.
Can anybody suggest what could be the problem?
PRODUCTION MANIFOLD ERROR:
Error: ERROR: duplicate key value
Hi ,
Is there any limit on the number of paths we can define in job using
Repository as Window Shares and ES as Output
Thanks
Hi,
I am getting Null pointer exceptions while creating a job programmatic
approach via PHP.
Can anybody suggest the reason for this?.
Error 500 Server Error
HTTP ERROR 500 Problem accessing
/mcf-api-service/json/jobs. Reason: Server ErrorCaused
by:java.lang.NullPointerException at
Hi All,
I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika
dependencies in Manifoldcf 2.14 version.
Getting some issues while parsing *PDF *files. Some strange characters
appeared, tried changing Tika jar files version also 1.24 and 1.27 (it
didn't even extract files correctly).
Can anybody have a clue on this ?
On Fri, Aug 20, 2021 at 12:33 PM ritika jain
wrote:
> Hi All,
>
> I am having a query , is there any way using which we can mention
> subdirectories' path in the file spec of Window shares connector.
>
> Like my requirement is to mention Mos
Hi All,
I am having a query , is there any way using which we can mention
subdirectories' path in the file spec of Window shares connector.
Like my requirement is to mention Most top hierarchical folder on top as
mentioned in Screenshot.
And in file spec requirement is to mention file name
Hi All,
When we delete a job in Manifoldcf .. Does it also delete the indexed
documents via that job from Elastic index as well ?
I understand that when a job is deleted from Manifoldcf interface it will
delete all the referenced documents via that job from postgres. But why is
it deleted from
Seems to be working now!!! Thanks a lot !!!
On Wed, Aug 11, 2021 at 6:22 PM ritika jain
wrote:
> Hi ,
>
> Yes this works only the difference is when a single file is ingested we
> are having ingested one as C:/Users/Dell/Desktop/abc.txt/.-with a UNWANTED
> slash at end
>
>
file name.
>
> Karl
>
>
> On Wed, Aug 11, 2021 at 2:14 AM ritika jain
> wrote:
>
>> *Dynamic Job *
>>
>> {"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description"
ute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\/*.dotx","_value_":"","_attribute_type":"file"},{"_attribute_indexable":"yes","_attribute_filespec":"\
Hi All,
I am using Window shares connector in 2.14 manifoldcf version and Elastic
as output.
I have created a dynamic manifoldcf job API via which a job will be created
in manifoldcf with inclusions list and path, only particular file path is
to be mentioned . Example file path:-
>
> Is the process the same when fetch/process status code returned is 302 ?
>>> When running a job with web crawler and ES output connector
>>>
>>
can anybody have a clue about this
L onto the document
> queue.
> When it gets to the new URL, it processes it like any other.
>
> Karl
>
>
> On Wed, May 19, 2021 at 8:32 AM ritika jain
> wrote:
>
>> Hi
>>
>> I want to understand the process of "How does manifold.cf handles
Hi
I want to understand the process of "How does manifold.cf handles
redirection of URL." in case of web crawler connector
If there is a page redirect (through a 301) to another URL, then the next
crawl will detect the redirect and index the new (final) URL and display it
in the search results.
Karl Wright wrote:
> '
>
> *JCIFS: Possibly transient exception detected on attempt 1 while getting
> share security'Yes, it is going to retry.*
>
> *Karl*
>
> On Fri, May 14, 2021 at 1:45 AM ritika jain
> wrote:
>
>> Hi,
>> I am using Windows shares conn
Hi,
I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch
connector as Output connector and Tika and Metadata adjuster as
Transformation connector
Trying to crawl the files from SMB server with 64 GB of server and Start
option file of manifoldcf is being given 32GB of memory
Hi All,
I want to understand the process of Manifoldcf Deletion . i.e in which all
cases Deletion process (When checked in Simple History) executes.
One case as per my knowledge , is the one whenever Seed URL of a particular
job is changed.
What all are the cases when Deletion process runs.
My
loaded into memory. That is why we require you to fill in a Solr field
> on those kind of output connections that limits the number of bytes.
>
> Karl
>
>
> On Tue, Feb 16, 2021 at 8:45 AM ritika jain
> wrote:
>
>>
>>
>> Hi users
>>
>>
>> I a
Hi users
I am using manifoldcf 2.14 Fileshare connector to crawl files from smb
server which is having some millions billions of records to process and
crawl.
Total system memory is 64Gb out of which start options file of manifold is
defined as 32GB.
We have some larger files to crawl around
Hi Users,
Can anybody tell me if this can be filled as bytes or kilobytes here.
The "Content Length tab looks like this:
[image: Windows Share Job, Content Length tab]
Values are to be filled as 100 , will this be 100 bytes or 100 kilobytes or
in MB.
Thanks
Ritika
Elastic search output connector with some custom changes for some fields
On Thursday, December 31, 2020, Karl Wright wrote:
> Hi,
> Can you let us know what you are using for the output connector?
> Thanks,
> Karl
>
>
> On Thu, Dec 31, 2020 at 8:24 AM ritika jain
> wr
Hi,
I am using Manifoldcf 2.14 and JCIFS connector, to ingest some billions of
records into elastic search
I am facing an issue in which when Job is run some time, successful
indexation happens but after sometime , manifoldcf loops the records and
Indexation is not getting OK.
[image: image.png]
Hi all,
Is there is a way to connect/mention more than one server in Elastic search
Output connector in server location(URL) field
[image: image.png]
Thanks
Ritika
ion and abstract method that
> would provide a shim for most connectors.
>
> Karl
>
>
>
>
>
>
> On Mon, Jul 6, 2020 at 8:52 AM ritika jain
> wrote:
>
>> Hi All,
>>
>> I have confusion regarding WebCrawler connector code.My requirement
Hi All,
I have confusion regarding WebCrawler connector code.My requirement is to
abort a job whenever a seed-corresponding site is down or returning some
5xx response codes.
So I have used the jobManager errorAbort method for this
in addSeedDocuments method of Webcrawlerconnector.java..,
Hi All,
I am using Window's shares connector and output connector as ES and
Postgres as database in Manifoldcf 2.14.
Job is to crawl almost 20lakhs of records.
When checked logs got this error:-
* Worker thread aborting and restarting due to database connection reset:
Database exception:
connecting? (a network issue)
Thanks
Ritika
On Tue, May 19, 2020 at 2:39 PM Karl Wright wrote:
> I commented in the ticket you created.
> Thanks,
> Karl
>
> On Tue, May 19, 2020 at 3:07 AM ritika jain
> wrote:
>
>> Hi All,
>>
>> I am configured Units job (Ma
Hi All,
I am configured Units job (Manifoldcf 2.14 and ES 7.6.2 and postgres
9.6.10) on server to access files from samba SMBv3 server and used
jcifs-ng-2.1.2.jar to be loaded in lib of manifoldcf.
After ingesting some records into the index , the got this error in logs :-
:-Unrecognized
Hello Users,
Can anybody please revert on this. It would be highly appreciated.
On Fri, Apr 3, 2020 at 2:28 PM ritika jain wrote:
> Hi All,
> I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo
> connector and Elastic Search as output connector,
> I want
those documents will be
> removed from the index.
>
> But you can override the robots behavior in the document specification or
> configuration, I believe.
>
> Karl
>
>
> On Thu, May 7, 2020 at 6:27 AM ritika jain
> wrote:
>
>> Hi All,
>>
>> Can any
Hi All,
Can any body explain
If a URL was indexed, and afterwards a noindex tag was added - will that
URL then be deleted from the index when it is visited again by the crawler?
Say a url was previously having indexation required meta tag and was
present in Elastic index, but then afterwards
Hi,
Can any body tell me please whether Manifoldcf 2.14 version is compatible
with Elastic Search Version 7.6.2 as it requires Java 11.
Thanks
Ritika
ou could get that. Have you started
> manifoldcf in debug mode? If so, what’s the output just before that
> statement in the logs?
>
>
>
> --
>
> Michael Cizmar
>
>
>
> *From: *ritika jain
> *Reply-To: *"user@manifoldcf.apache.org"
>
Hi All,
I am using Manifoldcf 2.14 Repository as Web crawler and Output as Elastic
Search. I have mentioned a seed URL which is valid as it is opening
successfully in browser.
Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en
Hi All,
I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo
connector and Elastic Search as output connector,
I want to get some knowledge about the crawling framework/hierarchy used by
the webcrawler.
As far as I know or I understand the crawling of the URL's works in the
Hi,
Is there any way to extract Manifodcf API based Json object of the manually
created job from interface
For example:-
Job_public has been created in Manifoldcf User interface version(2.14), how
can we extract all the configuration is API based object as suggested on (
Hi All,
I am using Manifoldcf 2.12, Repository as Web connector and Output as ES.
As per requirement now, I want to save all related sub-links of a
particular document Identifier(at a time). For example :-DocumentId::-
www.xyz.com, so I would like to extract all related sublinks say:-
51 matches
Mail list logo