Re: web crawler not sharing cookies

2018-07-26 Thread Gustavo Beneitez
Hi, database may contain Z.com and X.Y.Z.com if created automatically through a JSP, but not the intermediate one Y.Z.com. if the crawler decides to go to A.Y.Z.com and looking to database Z.com is present, it still doesn't work (it should since A.Y.Z is a sub-domain in Z). Only doing that

Re: web crawler not sharing cookies

2018-07-26 Thread Karl Wright
Ok, so the database for your site crawl contains both z.com and x.y.z.com cookies? And your site pages from domain a.y.z.com receive no cookies at all when fetched? Is that a correct description of the situation? Please verify that the a.y.z.com pages are part of the protected part of your

Re: web crawler not sharing cookies

2018-07-26 Thread Karl Wright
Here's the documentation from HttpClient on the various cookie policies. You're probably going to need to read some of the RFCs to see which policy you want. I will wait for you to get back to me with a recommendation before taking any action in the MCF codebase. Thanks!

Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
Hi Maxence, The following error: >> FATAL 2018-07-26T11:30:32,220 (Worker thread '28') - Error tossed: org/apache/poi/POIXMLTextExtractor java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTextExtractor at

Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
Hi Maxence, I am wondering whether you moved any jars from dist/connector-common-lib to dist/lib? If you did this, you will mess up the ability of any of the Tika jars to find their dependencies. This also explains why commons-compress cannot be found; it's in connector-common-lib. It sounds

Solr connection, max connections and CPU

2018-07-26 Thread Bisonti Mario
Hallo, I setup solr connection in the "Output connections" of Manifold I don't understand if there is a relation between "Max Connections" and the number of CPUs in the host. Could you help me ti understand it? Thanks a lot Mario

Re: Solr connection, max connections and CPU

2018-07-26 Thread Karl Wright
Hi Mario, There is no connection between the number of CPUs and the number output connections. You pick the maximum number of output connections based on the number of listening threads that you can use at the same time in Solr. Karl On Thu, Jul 26, 2018 at 9:22 AM Bisonti Mario wrote: >

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
The ContentLimiter truncates documents. That's not what you want. Use the Allowed Documents transformer. Karl On Thu, Jul 26, 2018 at 10:06 AM msaunier wrote: > I have add a Content limiter transformation before Tika extractor. It’s > very very slow now. It’s normal? > > > > Maxence, > > >

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
I believe there's also a content length tab in the Windows Share connector, if you're using that. Karl On Thu, Jul 26, 2018 at 10:19 AM Karl Wright wrote: > The ContentLimiter truncates documents. That's not what you want. > > Use the Allowed Documents transformer. > > Karl > > > On Thu, Jul

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
How are you limiting content size? Is this in the repository connection, or in an Allowed Documents transformation connection? Karl On Thu, Jul 26, 2018 at 10:58 AM msaunier wrote: > I have limit to 20Mb / document and I have again an out of memory java. > > > > > > > > *De :* Karl Wright

Re: ***UNCHECKED*** Re: Out of memory, one file bug i think

2018-07-26 Thread Karl Wright
The way it works in the JCIFS connector is that files that aren't within the specification are removed from the list of files being processed. If a file is already being processed, however, it is just retried. So changing this property to make an out-of-memory condition go away is not going to