ostgresql cron lazy vacuum
> cron:
> name: lazy_vacuum
> hour: 8
> minute: 0
> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
> - name: add postgresql cron full vacuum
> cron:
> name: full_vacuum
&
Sent: 03 September 2018 16:38
To: user@manifoldcf.apache.org
Subject: Re: Install Sharepoint 2013 plugin Failure
I'm afraid this is not something we can fix here, since we do not have a
Sharepoint 2013 server setup, and this seems particular to yours specifically.
The error you are getting looks
I'm afraid this is not something we can fix here, since we do not have a
Sharepoint 2013 server setup, and this seems particular to yours
specifically.
The error you are getting looks intermittent too:
>>
This server farm is not available.
<<
Karl
On Mon, Sep 3, 2018 at 6:19 AM Cheng
Hi Nikita,
This code loads the entire document into memory:
>>
String document = getAconexBody(aconexFile);
try
{
byte[] documentBytes = document.getBytes(StandardCharsets.UTF_8);
long fileLength = documentBytes.length;
if
So the Allowed Document transformer is now working, and your connector is
now skipping documents that are too large, correct? But you are still
seeing out of memory errors?
Does your connector load the entire document into memory before it calls
checkLengthIndexable()? Because if it does, that
Hi Karl,
The result for both the Length and checkLengthIndexable() method is same.
And the Allowed Document is also working. But main problem is crashing
down of the service and it displays memory Leakage error every time after
crawling few set of documents..
On Tue, Aug 28, 2018 at 6:48 PM,
Can you add logging messages to your connector to log (1) the length that
it sees, and (2) the result of checkLengthIndexable()? And then, please
once again add the Allowed Documents transformer and set a reasonable
document length. Run the job and see why it is rejecting your documents.
All of
Hi Karl,
Thank you for valuable suggestion.
The checkLengthIndexable() value is also used in the code and it is
returning the exact value for document length.
Also garbage collector and disposing for the threads is used.
On Tue, Aug 28, 2018 at 5:44 PM, Karl Wright wrote:
> I don't see
I don't see checkLengthIndexable() in this list. You need to add that if
you want your connector to be able to not try and index documents that are
too big.
You said before that when you added the Allowed Documents transformer to
the chain it removed ALL documents, so I suspect it's there but
Hi Karl,
These methods are already in use with the connector in the code where file
is need to read and ingest in the output.
(!activities.checkURLIndexable(fileUrl))
(!activities.checkMimeTypeIndexable(contentType))
(!activities.checkDateIndexable(modifiedDate))
But this service crashes after
Hi Nikita,
Until you fix your connector, nothing can be done to address your Out Of
Memory problem.
The problem is that you are not calling the following IProcessActivity
method:
/** Check whether a document of a specific length is indexable by the
currently specified output connector.
Hi Karl,
I have checked for the coding error, there is nothing like that as"Allowed
Document" is working fine for same code on the other system.
But now main issue being faced is "Shutting down of the ManifoldCF" and it
shows *"java.lang.OutOfMemoryError: GC overhead limit exceeded" on the
ql cron full vacuum
cron:
name: full_vacuum
weekday: 0
hour: 10
minute: 0
job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'"
# re-index all databases once a week
- name: add postgresql cron reindex
cron:
name: reindex
weekday: 0
hour: 12
minute: 0
job: "su - postgres -c 'psql
I'll publish them in a bit.
is might even be a configuration option). You can see this in a
>> browser's debug window when you reload the page a couple of times (Ctrl+F5
>> to force reloading).
>>
>>
>> -Konrad
>>
>> ------
>> *Von:* Karl Wright
>> *Gesende
hen you reload the page a couple of times (Ctrl+F5
> to force reloading).
>
>
> -Konrad
>
> --
> *Von:* Karl Wright
> *Gesendet:* Donnerstag, 23. August 2018 14:18
> *An:* user@manifoldcf.apache.org
> *Betreff:* [External] Re: Documents th
le of times (Ctrl+F5 to force reloading).
-Konrad
Von: Karl Wright
Gesendet: Donnerstag, 23. August 2018 14:18
An: user@manifoldcf.apache.org
Betreff: [External] Re: Documents that didn't change are reindexed
I would suggest downloading the pages using cu
I would suggest downloading the pages using curl a couple of times and
comparing content.
Headers also matter. Here's the code:
>>
// Calculate version from document data, which is presumed to
be present.
StringBuilder sb = new StringBuilder();
// Acls
Thanks Karl,
I've been launching the job a couple of times with a small set of documents
and what I see is that the elastic indexes every time each document, even
though the weight of the document is always the same and I don't notice any
"html dynamic content" like current time that could cause
ypes issued by ManifoldCF?
>
> Best Regards,
>
>
>
> Guy
>
>
>
> From: Karl Wright [mailto:daddy...@gmail.com <mailto:daddy...@gmail.com>]
> Sent: 06 August 2018 12:16
> To: user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org>
Hi Gustavo,
I take it from your question that you are using the Web Connector?
All connectors create a version string that is used to determine whether
content needs to be reindexed or not. The Web Connector's version string
uses a checksum of the page contents; we found the "last modified"
Obviously your Allowed Documents filter is somehow causing all documents to
be excluded. Since you have a custom repository connector I would bet
there is a coding error in it that is responsible.
Karl
On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja wrote:
> Hi Karl,
>
> Thanks for reply.
>
> I
Hi Karl,
Thanks for reply.
I am using in the same sequence. The allowed document is added first and
then the Tika Transformation.
But nothing runs in that scenario. The job simply ends without returning
anything in the output.
On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright wrote:
> Hi,
Hi,
You are running out of memory.
Tika's memory consumption is not well defined so you will need to limit the
size of documents that reach it. This is not the same as limiting the size
of documents *after* Tika extracts them.
The Allowed Documents transformer therefore should be placed in the
It works! Awesome! Thank you! A LOT!
Sven
From: Karl Wright [mailto:daddy...@gmail.com]
Sent: Thursday, August 16, 2018 7:39 PM
To: user@manifoldcf.apache.org
Subject: Re: Driver class not found: net.sourceforge.jtds.jdbc.Driver
Hi Sven,
When MCF is built, two entirely distinct versions
Hi Sven,
When MCF is built, two entirely distinct versions of the examples are
created -- a standard version, and a "proprietary" version. The
proprietary version does not in general include any proprietary jars and
leaves connectors that depend on them disabled in the connectors.xml file.
The
Hi Sharnel,
(1) I cannot create a patch unless you create a ticket I can attach it to.
(2) I can easily recognize this kind of corruption and allow MCF to skip
the document, and I've committed that change (r1838171). However,
partially indexing a document that is partially corrupted like this is
ltiprocess-file-example-proprietary/
>
> sudo cp
> /opt/manifoldcf_ok/multiprocess-file-example-proprietary/properties.xml
> /opt/manifoldcf/multiprocess-file-example-proprietary/
>
>
>
> sudo service tomcat start
>
>
>
>
>
> I obtained some warnings in th
gt;
>
>
>
>
>
>
> *Da:* Karl Wright
> *Inviato:* martedì 14 agosto 2018 15:25
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Different time in Simple History Report
>
>
>
> There were a number of files committed.
>
>
>
>
>
> On Tue, Aug
:* Karl Wright
> *Inviato:* martedì 14 agosto 2018 14:17
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Different time in Simple History Report
>
>
>
> Ok, I committed code that insures that all times displayed in reports are
> in the browser client timezone. The same
.dk/date.php
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright
> *Inviato:* martedì 14 agosto 2018 12:20
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Different time in Simple History Report
>
>
>
> Hi Mario,
>
>
>
> I did not change h
Hi Sven,
Please have a look at the Simple History report to see what happened to the
documents you are interested in.
The Web Connector will fetch binary documents no problem, but it sounds
like you have something else in your configuration that is causing them to
be rejected. The configuration
timezone, would be
> right if they would be equal to the “Start time:” filter and to the “Start
> time column”
>
>
>
>
>
>
>
> *Da:* Karl Wright
> *Inviato:* martedì 14 agosto 2018 12:04
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Different time in
he time 2 hour less of the right
> time and the report seems wrong time by the actual time as you can see in
> the attachment.
>
>
>
> So I rollback to the previous version.
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright
> *Inviato:* venerdì 10 a
downloaded trunk directory is very small, instead the last trunk
> was bigger:
>
> administrator@sengvivv01:~/mcfsorce$ du -sh tr*
>
> 121Mtrunk
>
> 1.8Gtrunk_19062018
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright
> *Inviato:* venerdì 10 agosto 2018 16:47
&
Wright
> *Inviato:* venerdì 10 agosto 2018 10:53
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Different time in Simple History Report
>
>
>
> I've committed a change to trunk which will restore the pre-2016 behavior.
>
>
>
> Karl
>
>
>
> On Fri, Aug
I see.
Starting another process somehow it turns out to the same point
DEBUG 2018-08-10T10:31:08,336 (Thread-38685) - Cancelling request execution
No one is pushing the button through the Web User Interface, so I guess I
have to look at the source code.
El vie., 10 ago. 2018 a las 10:55,
There is no configuration I know of that is related to this.
The connection manager being shut down occurs when the agents process is
being shut down or killed. This is not something that MCF will do except
when told to.
Karl
On Fri, Aug 10, 2018 at 4:28 AM Gustavo Beneitez
wrote:
> Hi
mezone is set as the browser timezone (europe/Rome) as you
>> can see, but the list is two hour less my time zone.
>>
>> So, it seems that the list uses the “universal time” instead of time zone
>>
>>
>>
>> administrator@sengvivv01:~$ timedatectl
>>
>>
Hi again Karl,
I've been further investigating this issue since today it happened again. I
was able to capture the instant when jobs stopped, It seems they receive an
abort command.
I then investigate the logs (they are centralised with Kibana) and the most
strange thing I can see is this log:
gt;
> Universal time: Fri 2018-08-10 06:39:28 UTC
>
> RTC time: Fri 2018-08-10 06:39:28
>
>Time zone: Europe/Rome (CEST, +0200)
>
>System clock synchronized: yes
>
> systemd-timesyncd.service active: ye
Yes, you are right, maybe I can execute OPTIMIZE queries if job fails
again, let's just cross fingers.
El jue., 9 ago. 2018 a las 12:28, Karl Wright ()
escribió:
> There is no autovacuum for MySQL. MySQL apparently does dead tuple
> cleanup as it goes.
>
> Karl
>
> On Thu, Aug 9, 2018 at 6:13
There is no autovacuum for MySQL. MySQL apparently does dead tuple cleanup
as it goes.
Karl
On Thu, Aug 9, 2018 at 6:13 AM Gustavo Beneitez
wrote:
> Hi,
>
> looking at the manifoldCF pom I can see
>
> 1.0.4-SNAPSHOT
>
> I'm not aware of any change in database, in fact ours is MySQL, I don't
>
Hi,
looking at the manifoldCF pom I can see
1.0.4-SNAPSHOT
I'm not aware of any change in database, in fact ours is MySQL, I don't
know if "auto_vacuum" property is present in MySQL installation.
Thanks!
El jue., 9 ago. 2018 a las 11:19, msaunier () escribió:
> Hi Gustavo,
>
>
>
> What is
Hi Gustavo,
What is your ManifoldCF version?
Do you have disabled auto_vacuum on your SQL configuration?
Maxence,
De : Gustavo Beneitez [mailto:gustavo.benei...@gmail.com]
Envoyé : jeudi 9 août 2018 11:17
À : user@manifoldcf.apache.org
Objet : crawl interrupted
Hi all,
allation and the problem was solved!
>
>
>
> Now I solved using the tika 1.19 versions nightly build.
>
>
>
>
>
> Thanks a lot.
>
>
>
>
>
>
>
> *Da:* Karl Wright
> *Inviato:* venerdì 27 luglio 2018 12:39
> *A:* user@manifoldcf.apache.org
> *Og
solved using the tika 1.19 versions nightly build.
Thanks a lot.
Da: Karl Wright mailto:daddy...@gmail.com> >
Inviato: venerdì 27 luglio 2018 12:39
A: user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org>
Oggetto: Re: Job stuck internal http error 500
I a
the first place, or
> it is just to be expected with the nature of the multiple worker threads
> and the query types issued by ManifoldCF?
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 August 2018 12:16
> *To:* user@
: Karl Wright [mailto:daddy...@gmail.com]
Sent: 06 August 2018 12:16
To: user@manifoldcf.apache.org
Subject: Re: PostgreSQL version to support MCF v2.10
These are exactly the same kind of issue as the first "error" reported. They
will be retried. If they did not get retried, they would abo
om:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 August 2018 10:52
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> Ah, the following errors:
>
> >>>>>>
>
> 2018-08-03 15:52:25
: Re: PostgreSQL version to support MCF v2.10
Ah, the following errors:
>>>>>>
2018-08-03 15:52:25.218 BST [4140] ERROR: could not serialize access due to
read/write dependencies among transactions
2018-08-03 15:52:25.218 BST [4140] DETAIL: Reason code: Canceled on
identif
t;
>>
>>
>> Best Regards
>>
>>
>>
>> Guy
>>
>>
>>
>> *From:* Steph van Schalkwyk [mailto:st...@remcam.net]
>> *Sent:* 03 August 2018 23:21
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: PostgreSQL
Schalkwyk [mailto:st...@remcam.net]
> *Sent:* 03 August 2018 23:21
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: PostgreSQL version to support MCF v2.10
>
>
>
> I'm using 10.4 with no issues.
>
> One or two of the recommended settings for MCF have changed between 9.
I'm using 10.4 with no issues.
One or two of the recommended settings for MCF have changed between 9.6 and
10.
Simple to resolve though.
Steph
On Fri, Aug 3, 2018 at 1:29 PM, Karl Wright wrote:
> Hi Guy,
>
> I use Postgresql 9.6 myself and have found no issues with it. I don't
> know about v
> How can I debug this? Any idea? Jetty have a log file?
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mardi 31 juillet 2018 15:32
> *À :* user@manifoldcf.
There must be a reason.
Karl
On Tue, Jul 31, 2018 at 8:18 AM msaunier wrote:
> Hello Karl,
>
>
>
> Today and yesterday, I have an error with Jetty. Jetty crash for no reason.
>
>
>
> Error:
>
> ./start.sh : ligne 41 : 562 Processus arrêté "$JAVA_HOME/bin/java"
> $OPTIONS
Hi Vinay,
Dynamic rescan is meant for web-crawling and revisits already crawled
documents based on how often they have changed in the past. It is
therefore wholly inappropriate for something like a file crawl, since
directory contents (one of the kinds of documents there are in a file
crawl)
Thanks Karl!
I applied both patches and we're back up and running! Thanks so much for
the help and the patches!
Mike
On Mon, Jul 30, 2018 at 3:32 PM, Karl Wright wrote:
> Ok, attached a second fix.
>
> Karl
>
>
> On Mon, Jul 30, 2018 at 4:09 PM Karl Wright wrote:
>
>> Yes, of course. I
Ok, attached a second fix.
Karl
On Mon, Jul 30, 2018 at 4:09 PM Karl Wright wrote:
> Yes, of course. I overlooked that. Will fix.
>
> Karl
>
>
> On Mon, Jul 30, 2018 at 3:54 PM Mike Hugo wrote:
>
>> That limit only applies to the list of transformations, not the list of
>> job IDs. If you
Yes, of course. I overlooked that. Will fix.
Karl
On Mon, Jul 30, 2018 at 3:54 PM Mike Hugo wrote:
> That limit only applies to the list of transformations, not the list of
> job IDs. If you follow the code into the next method
>
> >>
> /** Note registration for a batch of
That limit only applies to the list of transformations, not the list of job
IDs. If you follow the code into the next method
>>
/** Note registration for a batch of transformation connection names.
*/
protected void noteTransformationConnectionRegistration(List list)
throws
The limit is applied in the method that calls
noteTransformationConnectionRegistration.
Here it is:
>>
/** Note the registration of a transformation connector used by the
specified connections.
* This method will be called when a connector is registered, on which the
specified
*
Nice catch Karl!
I applied that patch, but I'm still getting the same error.
I think the problem is in JobManager.noteTransformationConnectionRe
gistration
If jobs.findJobsMatchingTransformations(list); returns a large list of ids
(like it is doing in our case - 39,941 ids ), the generated
The Postgresql driver supposedly limits this to 25 clauses at a pop:
>>
@Override
public int getMaxOrClause()
{
return 25;
}
/* Calculate the number of values a particular clause can have, given the
values for all the other clauses.
* For example, if in the expression x AND y
Hi Mike,
This might be the issue indeed. I'll look into it.
Karl
On Mon, Jul 30, 2018 at 2:26 PM Mike Hugo wrote:
> I'm not sure what the solution is yet, but I think I may have found the
> culprit:
>
> JobManager.noteTransformationConnectionRegistration(List list) is
> creating a pretty
I'm not sure what the solution is yet, but I think I may have found the
culprit:
JobManager.noteTransformationConnectionRegistration(List list) is
creating a pretty big query:
SELECT id,status FROM jobs WHERE (id=? OR id=? OR id=? OR id=? OR
id=?) FOR UPDATE
replace the elipsis with
>> I have a question about the schedule-related configuration in the job. I
>> have a continuously running job which crawls the documents in Sharepoint
>> 2013 and the job is supposed to re-crawl about 26,000 docs every 24
>> hours as configured, however, it seems that
,
Karl
On Mon, Jul 30, 2018 at 4:35 AM Cheng Zeng
mailto:ze...@hotmail.co.uk>> wrote:
Hi Karl,
I have a question about the schedule-related configuration in the job. I have a
continuously running job which crawls the documents in Sharepoint 2013 and the
job is supposed to re-crawl about 26,00
Well, I have absolutely no idea what is wrong and I've never seen anything
like that before. But postgres is complaining because the communication
with the JDBC client is being interrupted by something.
Karl
On Mon, Jul 30, 2018 at 10:39 AM Mike Hugo wrote:
> No, and manifold and postgres
No, and manifold and postgres run on the same host.
On Mon, Jul 30, 2018 at 9:35 AM, Karl Wright wrote:
> ' LOG: incomplete message from client'
>
> This shows a network issue. Did your network configuration change
> recently?
>
> Karl
>
>
> On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo wrote:
>
' LOG: incomplete message from client'
This shows a network issue. Did your network configuration change recently?
Karl
On Mon, Jul 30, 2018 at 9:59 AM Mike Hugo wrote:
> Tried a postgres vacuum and also a restart, but the problem persists.
> Here's the log again with some additional
Tried a postgres vacuum and also a restart, but the problem persists.
Here's the log again with some additional logging details added (below)
I tried running the last query from the logs against the database and it
works fine - I modified it to return a count and that also works.
SELECT count(*)
AM Cheng Zeng wrote:
> Hi Karl,
>
>
> I have a question about the schedule-related configuration in the job. I
> have a continuously running job which crawls the documents in Sharepoint
> 2013 and the job is supposed to re-crawl about 26,000 docs every 24 hours
> as configur
It looks to me like your database server is not happy. Maybe it's out of
resources? Not sure but a restart may be in order.
Karl
On Sun, Jul 29, 2018 at 9:06 AM Mike Hugo wrote:
> Recently we started seeing this error when Manifold CF starts up. We had
> been running Manifold CF with many
Can you view the job and include a screen shot of where this is displayed?
Thanks.
The exclusions are not regexps -- they are file specs. The file specs have
special meanings for "*" (matches everything) and "?" (matches one
character). You do not need to URL encode them.
If you enable
To solve your production problem I highly recommend limiting the size of
the docs fed to Tika, for a start. But that is no guarantee, I understand.
Out of memory problems are very hard to get good forensics for because they
cause major disruptions to the running server. You could turn on a
Hi Karl,
Okay. For the Out of Memory:
This is the last day that I can go on to find out where the error comes from.
After that, I should go into production to meet my deadlines.
I hope to find time in the future to be able to fix this problem on this
server, otherwise I could not index
set:
>
> sudo nano options.env.unix
>
> -Xms2048m
>
> -Xmx2048m
>
>
>
> But I obtain the same error.
>
> My doubt is that it could be a solr/tika problem.
>
> What could I do?
>
> I restrict the scan to a single file and I obtain the same error
>
>
>
Although it is not clear what process you are talking about. If solr ask
them.
Karl
On Fri, Jul 27, 2018, 5:36 AM Karl Wright wrote:
> I am presuming you are using the examples. If so, edit the options file
> to grant more memory to you agents process by increasing the Xmx value.
>
> Karl
>
I am presuming you are using the examples. If so, edit the options file to
grant more memory to you agents process by increasing the Xmx value.
Karl
On Fri, Jul 27, 2018, 3:04 AM Bisonti Mario wrote:
> Hallo.
>
> My job is stucking indexing an xlsx file of 38MB
>
>
>
> What could I do to
Thanks a lot Karl!!!
On 2018/07/26 13:28:47, Karl Wright wrote:
> Hi Mario,>
>
> There is no connection between the number of CPUs and the number output>
> connections. You pick the maximum number of output connections based on>
> the number of listening threads that you can use at the same
On Thu, Jul 26, 2018 at 11:09 AM msaunier wrote:
> On repository connection. I have add « 20971520 » on the max document size.
>
>
>
> Maxence
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* jeudi 26 juillet 2018 17:07
> *À :* us
gt;
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* jeudi 26 juillet 2018 16:23
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> I believe there's also a content length tab in the Windows Share
Karl Wright [mailto:daddy...@gmail.com]
>> *Envoyé :* mercredi 25 juillet 2018 19:15
>> *À :* user@manifoldcf.apache.org
>> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think
>>
>>
>>
>> It looks like you are still running out of memory. I
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:15
> *À :* user@manifoldcf.apache.org
> *Objet :* ***UNCHECKED*** Re: Out of memory, one file bug i think
>
>
>
> It looks like you are still run
Hi Mario,
There is no connection between the number of CPUs and the number output
connections. You pick the maximum number of output connections based on
the number of listening threads that you can use at the same time in Solr.
Karl
On Thu, Jul 26, 2018 at 9:22 AM Bisonti Mario
wrote:
>
der not loaded. jbig2 files will be ignored
>>
>> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>>
>> for optional dependencies.
>>
>> TIFFImageWriter not loaded. tiff files will not be processed
>>
>> See https://pdfbox.apache.org/2.0/d
onal dependencies.
>
> J2KImageReader not loaded. JPEG2000 files will not be processed.
>
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
>
> for optional dependencies.
>
>
>
> juil. 26, 2018 11:29:01 AM
> org.apache.tika.config.InitializableProblemHandler$3
&
Here's the documentation from HttpClient on the various cookie policies.
You're probably going to need to read some of the RFCs to see which policy
you want. I will wait for you to get back to me with a recommendation
before taking any action in the MCF codebase. Thanks!
Ok, so the database for your site crawl contains both z.com and x.y.z.com
cookies? And your site pages from domain a.y.z.com receive no cookies at
all when fetched? Is that a correct description of the situation?
Please verify that the a.y.z.com pages are part of the protected part of
your
Hi,
database may contain Z.com and X.Y.Z.com if created automatically through a
JSP, but not the intermediate one Y.Z.com.
if the crawler decides to go to A.Y.Z.com and looking to database Z.com is
present, it still doesn't work (it should since A.Y.Z is a sub-domain in Z).
Only doing that
The web connector, though, does not filter any cookies. It takes them all
-- whatever cookies HttpClient is storing at that point. So you should see
all the cookies in the database table, regardless of their site affinity,
unless HttpClient is refusing to accept a cookie for security reasons.
I agree, but the fact is that if my "login sequence" defines a login
credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
same cookie for every sub-domain, that solves the situation (and
thankfully is a
daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 19:18
> *À :* user@manifoldcf.apache.org
> *Objet :* Re: Speed up cleaning up job
>
>
>
> I'm sorry, I don't understand your question?
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jul 25, 2018 at 12:53 PM
Wright [mailto:daddy...@gmail.com]
Envoyé : mercredi 25 juillet 2018 19:18
À : user@manifoldcf.apache.org
Objet : Re: Speed up cleaning up job
I'm sorry, I don't understand your question?
Karl
On Wed, Jul 25, 2018 at 12:53 PM msaunier mailto:msaun...@citya.com> > wrote:
H
I'm sorry, I don't understand your question?
Karl
On Wed, Jul 25, 2018 at 12:53 PM msaunier wrote:
> Hi Karl,
>
>
>
> Can I configure ManifoldCF to cleaning up faster ? I think, ManifoldCF
> Clean 100 by 100 by default.
>
>
>
> Maxence,
>
>
>
You should not need to fill the database by hand. Your login sequence
should include whatever redirection etc is used to set the cookies though.
Karl
On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez
wrote:
> Hi again,
>
> Thanks Karl, I was able of doing that after defining some "login
>
wler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
>> ~[mcf-pull-agent.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
>> ~[?:?
.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
> ~[mcf-pull-agent.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.sys
> ~[?:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
>
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddy...@gmail.com]
> *Envoyé :* mercredi 25 juillet 2018 13:12
>
501 - 600 of 2608 matches
Mail list logo