Re: PostgreSQL version to support MCF v2.10

2018-09-04 Thread Olivier Tavard
ostgresql cron lazy vacuum > cron: > name: lazy_vacuum > hour: 8 > minute: 0 > job: "su - postgres -c 'vacuumdb --all --analyze --quiet'" > - name: add postgresql cron full vacuum > cron: > name: full_vacuum &

Re: Install Sharepoint 2013 plugin Failure

2018-09-03 Thread Cheng Zeng
Sent: 03 September 2018 16:38 To: user@manifoldcf.apache.org Subject: Re: Install Sharepoint 2013 plugin Failure I'm afraid this is not something we can fix here, since we do not have a Sharepoint 2013 server setup, and this seems particular to yours specifically. The error you are getting looks

Re: Install Sharepoint 2013 plugin Failure

2018-09-03 Thread Karl Wright
I'm afraid this is not something we can fix here, since we do not have a Sharepoint 2013 server setup, and this seems particular to yours specifically. The error you are getting looks intermittent too: >> This server farm is not available. << Karl On Mon, Sep 3, 2018 at 6:19 AM Cheng

Re: Exception in the running Custom Job

2018-08-29 Thread Karl Wright
Hi Nikita, This code loads the entire document into memory: >> String document = getAconexBody(aconexFile); try { byte[] documentBytes = document.getBytes(StandardCharsets.UTF_8); long fileLength = documentBytes.length; if

Re: Exception in the running Custom Job

2018-08-29 Thread Karl Wright
So the Allowed Document transformer is now working, and your connector is now skipping documents that are too large, correct? But you are still seeing out of memory errors? Does your connector load the entire document into memory before it calls checkLengthIndexable()? Because if it does, that

Re: Exception in the running Custom Job

2018-08-29 Thread Nikita Ahuja
Hi Karl, The result for both the Length and checkLengthIndexable() method is same. And the Allowed Document is also working. But main problem is crashing down of the service and it displays memory Leakage error every time after crawling few set of documents.. On Tue, Aug 28, 2018 at 6:48 PM,

Re: Exception in the running Custom Job

2018-08-28 Thread Karl Wright
Can you add logging messages to your connector to log (1) the length that it sees, and (2) the result of checkLengthIndexable()? And then, please once again add the Allowed Documents transformer and set a reasonable document length. Run the job and see why it is rejecting your documents. All of

Re: Exception in the running Custom Job

2018-08-28 Thread Nikita Ahuja
Hi Karl, Thank you for valuable suggestion. The checkLengthIndexable() value is also used in the code and it is returning the exact value for document length. Also garbage collector and disposing for the threads is used. On Tue, Aug 28, 2018 at 5:44 PM, Karl Wright wrote: > I don't see

Re: Exception in the running Custom Job

2018-08-28 Thread Karl Wright
I don't see checkLengthIndexable() in this list. You need to add that if you want your connector to be able to not try and index documents that are too big. You said before that when you added the Allowed Documents transformer to the chain it removed ALL documents, so I suspect it's there but

Re: Exception in the running Custom Job

2018-08-28 Thread Nikita Ahuja
Hi Karl, These methods are already in use with the connector in the code where file is need to read and ingest in the output. (!activities.checkURLIndexable(fileUrl)) (!activities.checkMimeTypeIndexable(contentType)) (!activities.checkDateIndexable(modifiedDate)) But this service crashes after

Re: Exception in the running Custom Job

2018-08-24 Thread Karl Wright
Hi Nikita, Until you fix your connector, nothing can be done to address your Out Of Memory problem. The problem is that you are not calling the following IProcessActivity method: /** Check whether a document of a specific length is indexable by the currently specified output connector.

Re: Exception in the running Custom Job

2018-08-24 Thread Nikita Ahuja
Hi Karl, I have checked for the coding error, there is nothing like that as"Allowed Document" is working fine for same code on the other system. But now main issue being faced is "Shutting down of the ManifoldCF" and it shows *"java.lang.OutOfMemoryError: GC overhead limit exceeded" on the

Re: PostgreSQL version to support MCF v2.10

2018-08-23 Thread Steph van Schalkwyk
ql cron full vacuum cron: name: full_vacuum weekday: 0 hour: 10 minute: 0 job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'" # re-index all databases once a week - name: add postgresql cron reindex cron: name: reindex weekday: 0 hour: 12 minute: 0 job: "su - postgres -c 'psql

Re: PostgreSQL version to support MCF v2.10

2018-08-23 Thread Steph van Schalkwyk
I'll publish them in a bit.

Re: [External] Re: Documents that didn't change are reindexed

2018-08-23 Thread Gustavo Beneitez
is might even be a configuration option). You can see this in a >> browser's debug window when you reload the page a couple of times (Ctrl+F5 >> to force reloading). >> >> >> -Konrad >> >> ------ >> *Von:* Karl Wright >> *Gesende

Re: [External] Re: Documents that didn't change are reindexed

2018-08-23 Thread Gustavo Beneitez
hen you reload the page a couple of times (Ctrl+F5 > to force reloading). > > > -Konrad > > -- > *Von:* Karl Wright > *Gesendet:* Donnerstag, 23. August 2018 14:18 > *An:* user@manifoldcf.apache.org > *Betreff:* [External] Re: Documents th

AW: [External] Re: Documents that didn't change are reindexed

2018-08-23 Thread Holl, Konrad
le of times (Ctrl+F5 to force reloading). -Konrad Von: Karl Wright Gesendet: Donnerstag, 23. August 2018 14:18 An: user@manifoldcf.apache.org Betreff: [External] Re: Documents that didn't change are reindexed I would suggest downloading the pages using cu

Re: Documents that didn't change are reindexed

2018-08-23 Thread Karl Wright
I would suggest downloading the pages using curl a couple of times and comparing content. Headers also matter. Here's the code: >> // Calculate version from document data, which is presumed to be present. StringBuilder sb = new StringBuilder(); // Acls

Re: Documents that didn't change are reindexed

2018-08-23 Thread Gustavo Beneitez
Thanks Karl, I've been launching the job a couple of times with a small set of documents and what I see is that the elastic indexes every time each document, even though the weight of the document is always the same and I don't notice any "html dynamic content" like current time that could cause

Re: PostgreSQL version to support MCF v2.10

2018-08-23 Thread Olivier Tavard
ypes issued by ManifoldCF? > > Best Regards, > > > > Guy > > > > From: Karl Wright [mailto:daddy...@gmail.com <mailto:daddy...@gmail.com>] > Sent: 06 August 2018 12:16 > To: user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org>

Re: Documents that didn't change are reindexed

2018-08-22 Thread Karl Wright
Hi Gustavo, I take it from your question that you are using the Web Connector? All connectors create a version string that is used to determine whether content needs to be reindexed or not. The Web Connector's version string uses a checksum of the page contents; we found the "last modified"

Re: Exception in the running Custom Job

2018-08-20 Thread Karl Wright
Obviously your Allowed Documents filter is somehow causing all documents to be excluded. Since you have a custom repository connector I would bet there is a coding error in it that is responsible. Karl On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja wrote: > Hi Karl, > > Thanks for reply. > > I

Re: Exception in the running Custom Job

2018-08-20 Thread Nikita Ahuja
Hi Karl, Thanks for reply. I am using in the same sequence. The allowed document is added first and then the Tika Transformation. But nothing runs in that scenario. The job simply ends without returning anything in the output. On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright wrote: > Hi,

Re: Exception in the running Custom Job

2018-08-20 Thread Karl Wright
Hi, You are running out of memory. Tika's memory consumption is not well defined so you will need to limit the size of documents that reach it. This is not the same as limiting the size of documents *after* Tika extracts them. The Allowed Documents transformer therefore should be placed in the

RE: Driver class not found: net.sourceforge.jtds.jdbc.Driver

2018-08-17 Thread Farrenkopf, Sven
It works! Awesome! Thank you! A LOT! Sven From: Karl Wright [mailto:daddy...@gmail.com] Sent: Thursday, August 16, 2018 7:39 PM To: user@manifoldcf.apache.org Subject: Re: Driver class not found: net.sourceforge.jtds.jdbc.Driver Hi Sven, When MCF is built, two entirely distinct versions

Re: Driver class not found: net.sourceforge.jtds.jdbc.Driver

2018-08-16 Thread Karl Wright
Hi Sven, When MCF is built, two entirely distinct versions of the examples are created -- a standard version, and a "proprietary" version. The proprietary version does not in general include any proprietary jars and leaves connectors that depend on them disabled in the connectors.xml file. The

Re: Documentum indexing issue

2018-08-16 Thread Karl Wright
Hi Sharnel, (1) I cannot create a patch unless you create a ticket I can attach it to. (2) I can easily recognize this kind of corruption and allow MCF to skip the document, and I've committed that change (r1838171). However, partially indexing a document that is partially corrupted like this is

Re: Different time in Simple History Report

2018-08-14 Thread Karl Wright
ltiprocess-file-example-proprietary/ > > sudo cp > /opt/manifoldcf_ok/multiprocess-file-example-proprietary/properties.xml > /opt/manifoldcf/multiprocess-file-example-proprietary/ > > > > sudo service tomcat start > > > > > > I obtained some warnings in th

Re: Different time in Simple History Report

2018-08-14 Thread Karl Wright
gt; > > > > > > > *Da:* Karl Wright > *Inviato:* martedì 14 agosto 2018 15:25 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: Different time in Simple History Report > > > > There were a number of files committed. > > > > > > On Tue, Aug

Re: Different time in Simple History Report

2018-08-14 Thread Karl Wright
:* Karl Wright > *Inviato:* martedì 14 agosto 2018 14:17 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: Different time in Simple History Report > > > > Ok, I committed code that insures that all times displayed in reports are > in the browser client timezone. The same

Re: Different time in Simple History Report

2018-08-14 Thread Karl Wright
.dk/date.php > > > > > > > > > > > > *Da:* Karl Wright > *Inviato:* martedì 14 agosto 2018 12:20 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: Different time in Simple History Report > > > > Hi Mario, > > > > I did not change h

Re: Using mainfoldCF as a webcrawler with tika and solr

2018-08-14 Thread Karl Wright
Hi Sven, Please have a look at the Simple History report to see what happened to the documents you are interested in. The Web Connector will fetch binary documents no problem, but it sounds like you have something else in your configuration that is causing them to be rejected. The configuration

Re: Different time in Simple History Report

2018-08-14 Thread Karl Wright
timezone, would be > right if they would be equal to the “Start time:” filter and to the “Start > time column” > > > > > > > > *Da:* Karl Wright > *Inviato:* martedì 14 agosto 2018 12:04 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: Different time in

Re: Different time in Simple History Report

2018-08-14 Thread Karl Wright
he time 2 hour less of the right > time and the report seems wrong time by the actual time as you can see in > the attachment. > > > > So I rollback to the previous version. > > > > > > > > > > > > *Da:* Karl Wright > *Inviato:* venerdì 10 a

Re: Different time in Simple History Report

2018-08-10 Thread Karl Wright
downloaded trunk directory is very small, instead the last trunk > was bigger: > > administrator@sengvivv01:~/mcfsorce$ du -sh tr* > > 121Mtrunk > > 1.8Gtrunk_19062018 > > > > > > > > > > *Da:* Karl Wright > *Inviato:* venerdì 10 agosto 2018 16:47 &

Re: Different time in Simple History Report

2018-08-10 Thread Karl Wright
Wright > *Inviato:* venerdì 10 agosto 2018 10:53 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: Different time in Simple History Report > > > > I've committed a change to trunk which will restore the pre-2016 behavior. > > > > Karl > > > > On Fri, Aug

Re: crawl interrupted

2018-08-10 Thread Gustavo Beneitez
I see. Starting another process somehow it turns out to the same point DEBUG 2018-08-10T10:31:08,336 (Thread-38685) - Cancelling request execution No one is pushing the button through the Web User Interface, so I guess I have to look at the source code. El vie., 10 ago. 2018 a las 10:55,

Re: crawl interrupted

2018-08-10 Thread Karl Wright
There is no configuration I know of that is related to this. The connection manager being shut down occurs when the agents process is being shut down or killed. This is not something that MCF will do except when told to. Karl On Fri, Aug 10, 2018 at 4:28 AM Gustavo Beneitez wrote: > Hi

Re: Different time in Simple History Report

2018-08-10 Thread Karl Wright
mezone is set as the browser timezone (europe/Rome) as you >> can see, but the list is two hour less my time zone. >> >> So, it seems that the list uses the “universal time” instead of time zone >> >> >> >> administrator@sengvivv01:~$ timedatectl >> >>

Re: crawl interrupted

2018-08-10 Thread Gustavo Beneitez
Hi again Karl, I've been further investigating this issue since today it happened again. I was able to capture the instant when jobs stopped, It seems they receive an abort command. I then investigate the logs (they are centralised with Kibana) and the most strange thing I can see is this log:

Re: Different time in Simple History Report

2018-08-10 Thread Karl Wright
gt; > Universal time: Fri 2018-08-10 06:39:28 UTC > > RTC time: Fri 2018-08-10 06:39:28 > >Time zone: Europe/Rome (CEST, +0200) > >System clock synchronized: yes > > systemd-timesyncd.service active: ye

Re: crawl interrupted

2018-08-09 Thread Gustavo Beneitez
Yes, you are right, maybe I can execute OPTIMIZE queries if job fails again, let's just cross fingers. El jue., 9 ago. 2018 a las 12:28, Karl Wright () escribió: > There is no autovacuum for MySQL. MySQL apparently does dead tuple > cleanup as it goes. > > Karl > > On Thu, Aug 9, 2018 at 6:13

Re: crawl interrupted

2018-08-09 Thread Karl Wright
There is no autovacuum for MySQL. MySQL apparently does dead tuple cleanup as it goes. Karl On Thu, Aug 9, 2018 at 6:13 AM Gustavo Beneitez wrote: > Hi, > > looking at the manifoldCF pom I can see > > 1.0.4-SNAPSHOT > > I'm not aware of any change in database, in fact ours is MySQL, I don't >

Re: crawl interrupted

2018-08-09 Thread Gustavo Beneitez
Hi, looking at the manifoldCF pom I can see 1.0.4-SNAPSHOT I'm not aware of any change in database, in fact ours is MySQL, I don't know if "auto_vacuum" property is present in MySQL installation. Thanks! El jue., 9 ago. 2018 a las 11:19, msaunier () escribió: > Hi Gustavo, > > > > What is

RE: crawl interrupted

2018-08-09 Thread msaunier
Hi Gustavo, What is your ManifoldCF version? Do you have disabled auto_vacuum on your SQL configuration? Maxence, De : Gustavo Beneitez [mailto:gustavo.benei...@gmail.com] Envoyé : jeudi 9 août 2018 11:17 À : user@manifoldcf.apache.org Objet : crawl interrupted Hi all,

Re: Job stuck internal http error 500

2018-08-08 Thread Karl Wright
allation and the problem was solved! > > > > Now I solved using the tika 1.19 versions nightly build. > > > > > > Thanks a lot. > > > > > > > > *Da:* Karl Wright > *Inviato:* venerdì 27 luglio 2018 12:39 > *A:* user@manifoldcf.apache.org > *Og

RE: Job stuck internal http error 500

2018-08-08 Thread msaunier
solved using the tika 1.19 versions nightly build. Thanks a lot. Da: Karl Wright mailto:daddy...@gmail.com> > Inviato: venerdì 27 luglio 2018 12:39 A: user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> Oggetto: Re: Job stuck internal http error 500 I a

Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
the first place, or > it is just to be expected with the nature of the multiple worker threads > and the query types issued by ManifoldCF? > > Best Regards, > > > > Guy > > > > *From:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* 06 August 2018 12:16 > *To:* user@

RE: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Standen Guy
: Karl Wright [mailto:daddy...@gmail.com] Sent: 06 August 2018 12:16 To: user@manifoldcf.apache.org Subject: Re: PostgreSQL version to support MCF v2.10 These are exactly the same kind of issue as the first "error" reported. They will be retried. If they did not get retried, they would abo

Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
om:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* 06 August 2018 10:52 > *To:* user@manifoldcf.apache.org > *Subject:* Re: PostgreSQL version to support MCF v2.10 > > > > Ah, the following errors: > > >>>>>> > > 2018-08-03 15:52:25

RE: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Standen Guy
: Re: PostgreSQL version to support MCF v2.10 Ah, the following errors: >>>>>> 2018-08-03 15:52:25.218 BST [4140] ERROR: could not serialize access due to read/write dependencies among transactions 2018-08-03 15:52:25.218 BST [4140] DETAIL: Reason code: Canceled on identif

Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
t; >> >> >> Best Regards >> >> >> >> Guy >> >> >> >> *From:* Steph van Schalkwyk [mailto:st...@remcam.net] >> *Sent:* 03 August 2018 23:21 >> *To:* user@manifoldcf.apache.org >> *Subject:* Re: PostgreSQL

Re: PostgreSQL version to support MCF v2.10

2018-08-06 Thread Karl Wright
Schalkwyk [mailto:st...@remcam.net] > *Sent:* 03 August 2018 23:21 > *To:* user@manifoldcf.apache.org > *Subject:* Re: PostgreSQL version to support MCF v2.10 > > > > I'm using 10.4 with no issues. > > One or two of the recommended settings for MCF have changed between 9.

Re: PostgreSQL version to support MCF v2.10

2018-08-03 Thread Steph van Schalkwyk
I'm using 10.4 with no issues. One or two of the recommended settings for MCF have changed between 9.6 and 10. Simple to resolve though. Steph On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright wrote: > Hi Guy, > > I use Postgresql 9.6 myself and have found no issues with it. I don't > know about v

Re: Jetty crash

2018-07-31 Thread Karl Wright
> How can I debug this? Any idea? Jetty have a log file? > > > > Cordialement, > > > > [image: msaunier] > > > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* mardi 31 juillet 2018 15:32 > *À :* user@manifoldcf.

Re: Jetty crash

2018-07-31 Thread Karl Wright
There must be a reason. Karl On Tue, Jul 31, 2018 at 8:18 AM msaunier wrote: > Hello Karl, > > > > Today and yesterday, I have an error with Jetty. Jetty crash for no reason. > > > > Error: > > ./start.sh : ligne 41 : 562 Processus arrêté "$JAVA_HOME/bin/java" > $OPTIONS

Re: Scheduler not working as we expected

2018-07-31 Thread Karl Wright
Hi Vinay, Dynamic rescan is meant for web-crawling and revisits already crawled documents based on how often they have changed in the past. It is therefore wholly inappropriate for something like a file crawl, since directory contents (one of the kinds of documents there are in a file crawl)

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Mike Hugo
Thanks Karl! I applied both patches and we're back up and running! Thanks so much for the help and the patches! Mike On Mon, Jul 30, 2018 at 3:32 PM, Karl Wright wrote: > Ok, attached a second fix. > > Karl > > > On Mon, Jul 30, 2018 at 4:09 PM Karl Wright wrote: > >> Yes, of course. I

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Ok, attached a second fix. Karl On Mon, Jul 30, 2018 at 4:09 PM Karl Wright wrote: > Yes, of course. I overlooked that. Will fix. > > Karl > > > On Mon, Jul 30, 2018 at 3:54 PM Mike Hugo wrote: > >> That limit only applies to the list of transformations, not the list of >> job IDs. If you

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Yes, of course. I overlooked that. Will fix. Karl On Mon, Jul 30, 2018 at 3:54 PM Mike Hugo wrote: > That limit only applies to the list of transformations, not the list of > job IDs. If you follow the code into the next method > > >> > /** Note registration for a batch of

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Mike Hugo
That limit only applies to the list of transformations, not the list of job IDs. If you follow the code into the next method >> /** Note registration for a batch of transformation connection names. */ protected void noteTransformationConnectionRegistration(List list) throws

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
The limit is applied in the method that calls noteTransformationConnectionRegistration. Here it is: >> /** Note the registration of a transformation connector used by the specified connections. * This method will be called when a connector is registered, on which the specified *

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Mike Hugo
Nice catch Karl! I applied that patch, but I'm still getting the same error. I think the problem is in JobManager.noteTransformationConnectionRe gistration If jobs.findJobsMatchingTransformations(list); returns a large list of ids (like it is doing in our case - 39,941 ids ), the generated

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
The Postgresql driver supposedly limits this to 25 clauses at a pop: >> @Override public int getMaxOrClause() { return 25; } /* Calculate the number of values a particular clause can have, given the values for all the other clauses. * For example, if in the expression x AND y

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Hi Mike, This might be the issue indeed. I'll look into it. Karl On Mon, Jul 30, 2018 at 2:26 PM Mike Hugo wrote: > I'm not sure what the solution is yet, but I think I may have found the > culprit: > > JobManager.noteTransformationConnectionRegistration(List list) is > creating a pretty

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Mike Hugo
I'm not sure what the solution is yet, but I think I may have found the culprit: JobManager.noteTransformationConnectionRegistration(List list) is creating a pretty big query: SELECT id,status FROM jobs WHERE (id=? OR id=? OR id=? OR id=? OR id=?) FOR UPDATE replace the elipsis with

Re: Scheduling Problem and the IBM Domino Connector

2018-07-30 Thread Karl Wright
>> I have a question about the schedule-related configuration in the job. I >> have a continuously running job which crawls the documents in Sharepoint >> 2013 and the job is supposed to re-crawl about 26,000 docs every 24 >> hours as configured, however, it seems that

Re: Scheduling Problem and the IBM Domino Connector

2018-07-30 Thread Cheng Zeng
, Karl On Mon, Jul 30, 2018 at 4:35 AM Cheng Zeng mailto:ze...@hotmail.co.uk>> wrote: Hi Karl, I have a question about the schedule-related configuration in the job. I have a continuously running job which crawls the documents in Sharepoint 2013 and the job is supposed to re-crawl about 26,00

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
Well, I have absolutely no idea what is wrong and I've never seen anything like that before. But postgres is complaining because the communication with the JDBC client is being interrupted by something. Karl On Mon, Jul 30, 2018 at 10:39 AM Mike Hugo wrote: > No, and manifold and postgres

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Mike Hugo
No, and manifold and postgres run on the same host. On Mon, Jul 30, 2018 at 9:35 AM, Karl Wright wrote: > ' LOG: incomplete message from client' > > This shows a network issue. Did your network configuration change > recently? > > Karl > > > On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo wrote: >

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Karl Wright
' LOG: incomplete message from client' This shows a network issue. Did your network configuration change recently? Karl On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo wrote: > Tried a postgres vacuum and also a restart, but the problem persists. > Here's the log again with some additional

Re: PSQLException: This connection has been closed.

2018-07-30 Thread Mike Hugo
Tried a postgres vacuum and also a restart, but the problem persists. Here's the log again with some additional logging details added (below) I tried running the last query from the logs against the database and it works fine - I modified it to return a count and that also works. SELECT count(*)

Re: Scheduling Problem

2018-07-30 Thread Karl Wright
AM Cheng Zeng wrote: > Hi Karl, > > > I have a question about the schedule-related configuration in the job. I > have a continuously running job which crawls the documents in Sharepoint > 2013 and the job is supposed to re-crawl about 26,000 docs every 24 hours > as configur

Re: PSQLException: This connection has been closed.

2018-07-29 Thread Karl Wright
It looks to me like your database server is not happy. Maybe it's out of resources? Not sure but a restart may be in order. Karl On Sun, Jul 29, 2018 at 9:06 AM Mike Hugo wrote: > Recently we started seeing this error when Manifold CF starts up. We had > been running Manifold CF with many

Re: Exclude files ~$*

2018-07-27 Thread Karl Wright
Can you view the job and include a screen shot of where this is displayed? Thanks. The exclusions are not regexps -- they are file specs. The file specs have special meanings for "*" (matches everything) and "?" (matches one character). You do not need to URL encode them. If you enable

Re: Tika/POI bugs

2018-07-27 Thread Karl Wright
To solve your production problem I highly recommend limiting the size of the docs fed to Tika, for a start. But that is no guarantee, I understand. Out of memory problems are very hard to get good forensics for because they cause major disruptions to the running server. You could turn on a

RE: Tika/POI bugs

2018-07-27 Thread msaunier
Hi Karl, Okay. For the Out of Memory: This is the last day that I can go on to find out where the error comes from. After that, I should go into production to meet my deadlines. I hope to find time in the future to be able to fix this problem on this server, otherwise I could not index

Re: Job stuck internal http error 500

2018-07-27 Thread Karl Wright
set: > > sudo nano options.env.unix > > -Xms2048m > > -Xmx2048m > > > > But I obtain the same error. > > My doubt is that it could be a solr/tika problem. > > What could I do? > > I restrict the scan to a single file and I obtain the same error > > >

Re: Job stuck internal http error 500

2018-07-27 Thread Karl Wright
Although it is not clear what process you are talking about. If solr ask them. Karl On Fri, Jul 27, 2018, 5:36 AM Karl Wright wrote: > I am presuming you are using the examples. If so, edit the options file > to grant more memory to you agents process by increasing the Xmx value. > > Karl >

Re: Job stuck internal http error 500

2018-07-27 Thread Karl Wright
I am presuming you are using the examples. If so, edit the options file to grant more memory to you agents process by increasing the Xmx value. Karl On Fri, Jul 27, 2018, 3:04 AM Bisonti Mario wrote: > Hallo. > > My job is stucking indexing an xlsx file of 38MB > > > > What could I do to

Re: Solr connection, max connections and CPU

2018-07-27 Thread Bisonti Mario
Thanks a lot Karl!!! On 2018/07/26 13:28:47, Karl Wright wrote: > Hi Mario,> > > There is no connection between the number of CPUs and the number output> > connections. You pick the maximum number of output connections based on> > the number of listening threads that you can use at the same

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
On Thu, Jul 26, 2018 at 11:09 AM msaunier wrote: > On repository connection. I have add « 20971520 » on the max document size. > > > > Maxence > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* jeudi 26 juillet 2018 17:07 > *À :* us

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
gt; > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* jeudi 26 juillet 2018 16:23 > *À :* user@manifoldcf.apache.org > *Objet :* Re: ***UNCHECKED*** Re: Out of memory, one file bug i think > > > > I believe there's also a content length tab in the Windows Share

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
Karl Wright [mailto:daddy...@gmail.com] >> *Envoyé :* mercredi 25 juillet 2018 19:15 >> *À :* user@manifoldcf.apache.org >> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think >> >> >> >> It looks like you are still running out of memory. I

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
> > Maxence, > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* mercredi 25 juillet 2018 19:15 > *À :* user@manifoldcf.apache.org > *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think > > > > It looks like you are still run

Re: Solr connection, max connections and CPU

2018-07-26 Thread Karl Wright
Hi Mario, There is no connection between the number of CPUs and the number output connections. You pick the maximum number of output connections based on the number of listening threads that you can use at the same time in Solr. Karl On Thu, Jul 26, 2018 at 9:22 AM Bisonti Mario wrote: >

Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
der not loaded. jbig2 files will be ignored >> >> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io >> >> for optional dependencies. >> >> TIFFImageWriter not loaded. tiff files will not be processed >> >> See https://pdfbox.apache.org/2.0/d

Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
onal dependencies. > > J2KImageReader not loaded. JPEG2000 files will not be processed. > > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > > for optional dependencies. > > > > juil. 26, 2018 11:29:01 AM > org.apache.tika.config.InitializableProblemHandler$3 &

Re: web crawler not sharing cookies

2018-07-26 Thread Karl Wright
Here's the documentation from HttpClient on the various cookie policies. You're probably going to need to read some of the RFCs to see which policy you want. I will wait for you to get back to me with a recommendation before taking any action in the MCF codebase. Thanks!

Re: web crawler not sharing cookies

2018-07-26 Thread Karl Wright
Ok, so the database for your site crawl contains both z.com and x.y.z.com cookies? And your site pages from domain a.y.z.com receive no cookies at all when fetched? Is that a correct description of the situation? Please verify that the a.y.z.com pages are part of the protected part of your

Re: web crawler not sharing cookies

2018-07-26 Thread Gustavo Beneitez
Hi, database may contain Z.com and X.Y.Z.com if created automatically through a JSP, but not the intermediate one Y.Z.com. if the crawler decides to go to A.Y.Z.com and looking to database Z.com is present, it still doesn't work (it should since A.Y.Z is a sub-domain in Z). Only doing that

Re: web crawler not sharing cookies

2018-07-25 Thread Karl Wright
The web connector, though, does not filter any cookies. It takes them all -- whatever cookies HttpClient is storing at that point. So you should see all the cookies in the database table, regardless of their site affinity, unless HttpClient is refusing to accept a cookie for security reasons.

Re: web crawler not sharing cookies

2018-07-25 Thread Gustavo Beneitez
I agree, but the fact is that if my "login sequence" defines a login credential for domain "Z.com" and the crawler reaches "Y.Z.com" or " X.Y.Z.com", none of the sub-sites receives that cookie, I need to write same cookie for every sub-domain, that solves the situation (and thankfully is a

Re: Speed up cleaning up job

2018-07-25 Thread Karl Wright
daddy...@gmail.com] > *Envoyé :* mercredi 25 juillet 2018 19:18 > *À :* user@manifoldcf.apache.org > *Objet :* Re: Speed up cleaning up job > > > > I'm sorry, I don't understand your question? > > > > Karl > > > > > > On Wed, Jul 25, 2018 at 12:53 PM

RE: Speed up cleaning up job

2018-07-25 Thread msaunier
Wright [mailto:daddy...@gmail.com] Envoyé : mercredi 25 juillet 2018 19:18 À : user@manifoldcf.apache.org Objet : Re: Speed up cleaning up job I'm sorry, I don't understand your question? Karl On Wed, Jul 25, 2018 at 12:53 PM msaunier mailto:msaun...@citya.com> > wrote: H

Re: Speed up cleaning up job

2018-07-25 Thread Karl Wright
I'm sorry, I don't understand your question? Karl On Wed, Jul 25, 2018 at 12:53 PM msaunier wrote: > Hi Karl, > > > > Can I configure ManifoldCF to cleaning up faster ? I think, ManifoldCF > Clean 100 by 100 by default. > > > > Maxence, > > >

Re: web crawler not sharing cookies

2018-07-25 Thread Karl Wright
You should not need to fill the database by hand. Your login sequence should include whatever redirection etc is used to set the cookies though. Karl On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez wrote: > Hi again, > > Thanks Karl, I was able of doing that after defining some "login >

***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
wler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) >> ~[mcf-pull-agent.jar:?] >> >> at >> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) >> ~[?:?

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548) > ~[mcf-pull-agent.jar:?] > > at > org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939) > ~[?:?] > > at > org.apache.manifoldcf.crawler.sys

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
> ~[?:?] > > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?] > > > > Maxence, > > > > > > *De :* Karl Wright [mailto:daddy...@gmail.com] > *Envoyé :* mercredi 25 juillet 2018 13:12 >

<    1   2   3   4   5   6   7   8   9   10   >