Here's the documentation from HttpClient on the various cookie policies.
You're probably going to need to read some of the RFCs to see which policy
you want.  I will wait for you to get back to me with a recommendation
before taking any action in the MCF codebase.  Thanks!

https://hc.apache.org/httpcomponents-client-ga/tutorial/html/statemgmt.html

Karl


On Thu, Jul 26, 2018 at 3:19 AM Karl Wright <daddy...@gmail.com> wrote:

> Ok, so the database for your site crawl contains both z.com and x.y.z.com
> cookies?  And your site pages from domain a.y.z.com receive no cookies at
> all when fetched?  Is that a correct description of the situation?
>
> Please verify that the a.y.z.com pages are part of the protected part of
> your "site".  The regular expression that describes site membership for the
> login sequence you are trying to set up must include them or they will not
> receive any cookies no matter what we do.
>
> If this is set up correctly, then the only explanation is the HttpClient
> cookie policy in effect for site fetches.  It does not look like we
> override the cookie policy anywhere when setting up the client:
>
>         PoolingHttpClientConnectionManager poolingConnManager = new
> PoolingHttpClientConnectionManager(RegistryBuilder.<ConnectionSocketFactory>create()
>           .register("http",
> PlainConnectionSocketFactory.getSocketFactory())
>           .register("https", myFactory)
>           .build());
>         poolingConnManager.setDefaultMaxPerRoute(1);
>         poolingConnManager.setValidateAfterInactivity(2000);
>         poolingConnManager.setDefaultSocketConfig(SocketConfig.custom()
>           .setTcpNoDelay(true)
>           .setSoTimeout(socketTimeoutMilliseconds)
>           .build());
>         connManager = poolingConnManager;
>       }
>
>
> HttpClient tends to default to "strict" when stuff is not specified.  I'll
> see if I can find out what the behavior is.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez <
> gustavo.benei...@gmail.com> wrote:
>
>> Hi,
>>
>> database may contain Z.com and X.Y.Z.com if created automatically
>> through a JSP, but not the intermediate one Y.Z.com.
>>
>> if the crawler decides to go to A.Y.Z.com and looking to database Z.com
>> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in
>> Z).
>>
>> Only doing that changes by hand (replacing domain with sub-domain in
>> database) and restarting manifold it begins to work.
>>
>> There might be security constrains somehow, I will consider further
>> analysis.
>>
>> Regards.
>>
>>
>> El jue., 26 jul. 2018 a las 0:06, Karl Wright (<daddy...@gmail.com>)
>> escribió:
>>
>>> The web connector, though, does not filter any cookies.  It takes them
>>> all -- whatever cookies HttpClient is storing at that point.  So you should
>>> see all the cookies in the database table, regardless of their site
>>> affinity, unless HttpClient is refusing to accept a cookie for security
>>> reasons.
>>>
>>> It's also possible that HttpClient is selective about which cookies to
>>> transmit on a page fetch.
>>>
>>> Can you look in the database and tell me whether your cookie gets
>>> stored, or not?  If not, then HttpClient's cookie acceptance policy is not
>>> lenient enough.  If it is in the database, then it's the transmission
>>> policy that is too strict.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
>>> gustavo.benei...@gmail.com> wrote:
>>>
>>>> I agree, but the fact is that if my "login sequence" defines a login
>>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to
>>>> write same cookie  for every sub-domain, that solves the situation (and
>>>> thankfully is a language cookie and not a dynamic one).
>>>>
>>>> Regards.
>>>>
>>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<daddy...@gmail.com>)
>>>> escribió:
>>>>
>>>>> You should not need to fill the database by hand.  Your login sequence
>>>>> should include whatever redirection etc is used to set the cookies though.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>>>> gustavo.benei...@gmail.com> wrote:
>>>>>
>>>>>> Hi again,
>>>>>>
>>>>>> Thanks Karl, I was able of doing that after defining some "login
>>>>>> sequence", but also after filling database (cookiedata table) with 
>>>>>> certain
>>>>>> values due to "domain constrictions".
>>>>>> Before every web call, I suspect Manifold only takes cookies from URL
>>>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "
>>>>>> z.com" it won't be sent, so I added every subdomain by hand and
>>>>>> started to work.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>
>>>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>>>>> gustavo.benei...@gmail.com>) escribió:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> thanks a lot, please let me check then the documentation for an
>>>>>>> example of that.
>>>>>>>
>>>>>>> Regards!
>>>>>>>
>>>>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<daddy...@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> You are correct that cookies are not shared among threads.  That is
>>>>>>>> by design.
>>>>>>>>
>>>>>>>> The only way to set cookies for the WebConnector is to have there
>>>>>>>> be a "login sequence".  The login sequence sets cookies that are then 
>>>>>>>> used
>>>>>>>> by all subsequent fetches.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>>>>> gustavo.benei...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I have tried to look for an answer before writing this email, no
>>>>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>>>>
>>>>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>>>>> cookie rules the language you get the content, and while there are 
>>>>>>>>> several
>>>>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>>>>
>>>>>>>>> I made a JSP which sets the cookie and contains several links
>>>>>>>>> (href), and pointed ManifoldCF to this page as the repository seed. I
>>>>>>>>> expected to get the crawling engine starting to capture links with 
>>>>>>>>> correct
>>>>>>>>> language indicated by the cookie, but what I really got is a lot of 
>>>>>>>>> content
>>>>>>>>> shown in default language.
>>>>>>>>>
>>>>>>>>> What I think about that is that cookies are not shared between
>>>>>>>>> thread spiders, so it is not possible to get cookies remain between 
>>>>>>>>> links.
>>>>>>>>> Cookie domain is correct, also cookie expiration
>>>>>>>>>
>>>>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>>>>
>>>>>>>>> Thanks in advance!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Reply via email to