Re: web crawler not sharing cookies

2018-07-25 Thread Karl Wright
The web connector, though, does not filter any cookies. It takes them all -- whatever cookies HttpClient is storing at that point. So you should see all the cookies in the database table, regardless of their site affinity, unless HttpClient is refusing to accept a cookie for security reasons.

Re: web crawler not sharing cookies

2018-07-25 Thread Gustavo Beneitez
I agree, but the fact is that if my "login sequence" defines a login credential for domain "Z.com" and the crawler reaches "Y.Z.com" or " X.Y.Z.com", none of the sub-sites receives that cookie, I need to write same cookie for every sub-domain, that solves the situation (and thankfully is a

Re: Speed up cleaning up job

2018-07-25 Thread Karl Wright
The "cleaning up" phase deletes the documents in the target index (where your outputconnectors point). That takes more time. Karl On Wed, Jul 25, 2018 at 1:43 PM msaunier wrote: > If I delete a job on ManifoldCF, jobs pass in « Cleaning Up » status. > > > > « Processed » document are delete

RE: Speed up cleaning up job

2018-07-25 Thread msaunier
If I delete a job on ManifoldCF, jobs pass in « Cleaning Up » status. « Processed » document are delete very fast « Active » documents too. But « Documents » on the interface, it’s very slow to delete every lines. ManifoldCF delete Documents 100 by 100. Maxence, De : Karl

Re: Speed up cleaning up job

2018-07-25 Thread Karl Wright
I'm sorry, I don't understand your question? Karl On Wed, Jul 25, 2018 at 12:53 PM msaunier wrote: > Hi Karl, > > > > Can I configure ManifoldCF to cleaning up faster ? I think, ManifoldCF > Clean 100 by 100 by default. > > > > Maxence, > > >

Re: web crawler not sharing cookies

2018-07-25 Thread Karl Wright
You should not need to fill the database by hand. Your login sequence should include whatever redirection etc is used to set the cookies though. Karl On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez wrote: > Hi again, > > Thanks Karl, I was able of doing that after defining some "login >

***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
It looks like you are still running out of memory. I would love to know what document it was that doing that. I suspect it is very large already, and for some reason it cannot be streamed. Karl On Wed, Jul 25, 2018 at 1:13 PM Karl Wright wrote: > Hi Maxence, > > The second exception is

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
Hi Maxence, The second exception is occurring because processing is still occurring while the JVM is shutting down; it can be ignored. Karl On Wed, Jul 25, 2018 at 1:01 PM msaunier wrote: > Hi Karl, > > > > I have add the snapshot and I’m spam with this error : > > > > FATAL

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
That's what I was afraid of. The new poi jars have dependencies we haven't accounted for yet. Can you download apache-commons-compress jar (latest version should be OK) and also put that in connector-common-lib? Thanks!! Karl On Wed, Jul 25, 2018 at 1:01 PM msaunier wrote: > Hi Karl, > > >

Re: web crawler not sharing cookies

2018-07-25 Thread Gustavo Beneitez
Hi again, Thanks Karl, I was able of doing that after defining some "login sequence", but also after filling database (cookiedata table) with certain values due to "domain constrictions". Before every web call, I suspect Manifold only takes cookies from URL exact subdomain (i.e. x.y.z.com), so if

Speed up cleaning up job

2018-07-25 Thread msaunier
Hi Karl, Can I configure ManifoldCF to cleaning up faster ? I think, ManifoldCF Clean 100 by 100 by default. Maxence,

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
Out of memory errors are fatal, I'm afraid, because they corrupt not only the document in question but all others being processed at the same time. So those cannot be ignored. Tika should ignore documents that it cannot process, however, and that is a great enhancement request for them. Karl

Re: Out of memory, one file bug i think

2018-07-25 Thread Karl Wright
Hi Maxence, Tomorrow (7/26) the POI project will be delivering a nightly build which should repair the Class Not Found exceptions. You will need to download it here: https://builds.apache.org/view/P/view/POI/job/POI-DSL-1.8/lastSuccessfulBuild/artifact/build/dist/ ... and replace all poi jars