Here's the documentation from HttpClient on the various cookie policies. You're probably going to need to read some of the RFCs to see which policy you want. I will wait for you to get back to me with a recommendation before taking any action in the MCF codebase. Thanks!
https://hc.apache.org/httpcomponents-client-ga/tutorial/html/statemgmt.html Karl On Thu, Jul 26, 2018 at 3:19 AM Karl Wright <daddy...@gmail.com> wrote: > Ok, so the database for your site crawl contains both z.com and x.y.z.com > cookies? And your site pages from domain a.y.z.com receive no cookies at > all when fetched? Is that a correct description of the situation? > > Please verify that the a.y.z.com pages are part of the protected part of > your "site". The regular expression that describes site membership for the > login sequence you are trying to set up must include them or they will not > receive any cookies no matter what we do. > > If this is set up correctly, then the only explanation is the HttpClient > cookie policy in effect for site fetches. It does not look like we > override the cookie policy anywhere when setting up the client: > > PoolingHttpClientConnectionManager poolingConnManager = new > PoolingHttpClientConnectionManager(RegistryBuilder.<ConnectionSocketFactory>create() > .register("http", > PlainConnectionSocketFactory.getSocketFactory()) > .register("https", myFactory) > .build()); > poolingConnManager.setDefaultMaxPerRoute(1); > poolingConnManager.setValidateAfterInactivity(2000); > poolingConnManager.setDefaultSocketConfig(SocketConfig.custom() > .setTcpNoDelay(true) > .setSoTimeout(socketTimeoutMilliseconds) > .build()); > connManager = poolingConnManager; > } > > > HttpClient tends to default to "strict" when stuff is not specified. I'll > see if I can find out what the behavior is. > > Karl > > > On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez < > gustavo.benei...@gmail.com> wrote: > >> Hi, >> >> database may contain Z.com and X.Y.Z.com if created automatically >> through a JSP, but not the intermediate one Y.Z.com. >> >> if the crawler decides to go to A.Y.Z.com and looking to database Z.com >> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in >> Z). >> >> Only doing that changes by hand (replacing domain with sub-domain in >> database) and restarting manifold it begins to work. >> >> There might be security constrains somehow, I will consider further >> analysis. >> >> Regards. >> >> >> El jue., 26 jul. 2018 a las 0:06, Karl Wright (<daddy...@gmail.com>) >> escribió: >> >>> The web connector, though, does not filter any cookies. It takes them >>> all -- whatever cookies HttpClient is storing at that point. So you should >>> see all the cookies in the database table, regardless of their site >>> affinity, unless HttpClient is refusing to accept a cookie for security >>> reasons. >>> >>> It's also possible that HttpClient is selective about which cookies to >>> transmit on a page fetch. >>> >>> Can you look in the database and tell me whether your cookie gets >>> stored, or not? If not, then HttpClient's cookie acceptance policy is not >>> lenient enough. If it is in the database, then it's the transmission >>> policy that is too strict. >>> >>> Thanks, >>> Karl >>> >>> >>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez < >>> gustavo.benei...@gmail.com> wrote: >>> >>>> I agree, but the fact is that if my "login sequence" defines a login >>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or " >>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to >>>> write same cookie for every sub-domain, that solves the situation (and >>>> thankfully is a language cookie and not a dynamic one). >>>> >>>> Regards. >>>> >>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<daddy...@gmail.com>) >>>> escribió: >>>> >>>>> You should not need to fill the database by hand. Your login sequence >>>>> should include whatever redirection etc is used to set the cookies though. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez < >>>>> gustavo.benei...@gmail.com> wrote: >>>>> >>>>>> Hi again, >>>>>> >>>>>> Thanks Karl, I was able of doing that after defining some "login >>>>>> sequence", but also after filling database (cookiedata table) with >>>>>> certain >>>>>> values due to "domain constrictions". >>>>>> Before every web call, I suspect Manifold only takes cookies from URL >>>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as " >>>>>> z.com" it won't be sent, so I added every subdomain by hand and >>>>>> started to work. >>>>>> >>>>>> Regards. >>>>>> >>>>>> >>>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (< >>>>>> gustavo.benei...@gmail.com>) escribió: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> thanks a lot, please let me check then the documentation for an >>>>>>> example of that. >>>>>>> >>>>>>> Regards! >>>>>>> >>>>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<daddy...@gmail.com>) >>>>>>> escribió: >>>>>>> >>>>>>>> You are correct that cookies are not shared among threads. That is >>>>>>>> by design. >>>>>>>> >>>>>>>> The only way to set cookies for the WebConnector is to have there >>>>>>>> be a "login sequence". The login sequence sets cookies that are then >>>>>>>> used >>>>>>>> by all subsequent fetches. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez < >>>>>>>> gustavo.benei...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I have tried to look for an answer before writing this email, no >>>>>>>>> luck. Sorry for the inconvenience if it is already answered. >>>>>>>>> >>>>>>>>> I need to set a cookie at the begining of the web crawling. The >>>>>>>>> cookie rules the language you get the content, and while there are >>>>>>>>> several >>>>>>>>> choices, if no cookie is found there will be a "default language". >>>>>>>>> >>>>>>>>> I made a JSP which sets the cookie and contains several links >>>>>>>>> (href), and pointed ManifoldCF to this page as the repository seed. I >>>>>>>>> expected to get the crawling engine starting to capture links with >>>>>>>>> correct >>>>>>>>> language indicated by the cookie, but what I really got is a lot of >>>>>>>>> content >>>>>>>>> shown in default language. >>>>>>>>> >>>>>>>>> What I think about that is that cookies are not shared between >>>>>>>>> thread spiders, so it is not possible to get cookies remain between >>>>>>>>> links. >>>>>>>>> Cookie domain is correct, also cookie expiration >>>>>>>>> >>>>>>>>> I would appreciate so much if you can help me on this. >>>>>>>>> >>>>>>>>> Thanks in advance! >>>>>>>>> >>>>>>>>> >>>>>>>>>