Re: Crawling behind an ISA proxy (iis 7.5)

Karl Wright Thu, 28 Jun 2012 02:26:34 -0700

I was wondering if you'd picked up and tried the patch for
CONNECTORS-483.  This patch adds official proxy support for the Web
Connector.  Alternatively, you could try to build and run with trunk
code.


Karl

On Wed, May 16, 2012 at 12:12 PM, Karl Wright <daddy...@gmail.com> wrote:
> Hi Rene,
>
> The URL that is causing the RFC2617 challenge/response is being
> authenticated with basic auth, not NTLM.  This could yield a 401.  You
> may want to check the URL in a browser other than IE (Firefox, for
> instance) to see if basic auth is being used for this URL rather than
> NTLM.
>
> The redirection you describe to GetLogon is pretty standard practice.
> You can easily tell the web connector that that is part of the logon
> sequence by following the steps I laid out in the earlier email.
>
> Once you have set up what you think is the right set of logon pages,
> it's very helpful to attempt a crawl and then see what the simple
> history shows.  There are specific activities logged when logon begins
> and ends, so this is enormously helpful as a diagnostic aid.  If you
> see a continuous loop (entering logon sequence, doing stuff, exiting
> logon sequence, and repeating) then it is clear that the cookie has
> not been set.
>
> I won't be able to look at your packet log for a while, probably at
> least a week.
>
> Karl
>
>
>
> On Wed, May 16, 2012 at 10:23 AM, Rene Nederhand <r...@nederhand.net> wrote:
>> Hi Karl,
>>
>> Thank you so much for putting a so much time in educating a newbe. I
>> appreciate your help enormously.
>>
>> I'd tried to follow each of the steps below. So far, it doesn't work but I
>> will continue this evening to see if I can get this thing going.
>>
>> In the mean time, I have switched loglevels of the crawling proces to "INFO"
>> and found something interesting in the logs. Perhaps, this could shine some
>> light on my issues:
>>
>> ERROR 2012-05-16 16:04:13,581 (Thread-1019) - Invalid challenge: Basic
>> org.apache.commons.httpclient.auth.MalformedChallengeException: Invalid
>> challenge: Basic
>> at
>> org.apache.commons.httpclient.auth.AuthChallengeParser.extractParams(Unknown
>> Source)
>> at org.apache.commons.httpclient.auth.RFC2617Scheme.processChallenge(Unknown
>> Source)
>> at org.apache.commons.httpclient.auth.BasicScheme.processChallenge(Unknown
>> Source)
>> at
>> org.apache.commons.httpclient.auth.AuthChallengeProcessor.processChallenge(Unknown
>> Source)
>> at
>> org.apache.commons.httpclient.HttpMethodDirector.processWWWAuthChallenge(Unknown
>> Source)
>> at
>> org.apache.commons.httpclient.HttpMethodDirector.processAuthenticationResponse(Unknown
>> Source)
>> at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
>> Source)
>> at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source)
>> at
>> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)
>>
>> Please not that I have set NTLM (not BASIC) authentication on
>> "bb.helo.hanze.nl" and nothing else. The error does not occur when I try to
>> crawl our intranet (also with NTLM). Does this mean something? At least, I
>> think it is the source of the 401 I get when looking at the simple report,
>> isn't it?
>>
>> In addition, I've used Charles proxy to monitor all interaction between my
>> browser and the server. I have found that it doesn't matter which url I use
>> to enter Blackboard, they get all directed to
>> https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon. Shouldn't page based
>> authentication handle this?
>>
>> To make the information complete, I've added the HAR file with the
>> CharlesProxy output. It can be displayed
>> at http://www.softwareishard.com/har/viewer/ for example. You'll be able to
>> see all requests/responses when I start with a clean browser (cookies
>> removed) entering https://bb.helo.hanze.nl. Maybe, this does help.
>>
>> Again, thanks a lot for your help!
>>
>> René
>>
>>
>>
>>
>>
>> On Tue, May 15, 2012 at 5:59 PM, Karl Wright <daddy...@gmail.com> wrote:
>>>
>>> Hi Rene,
>>>
>>> You will need both NTLM auth (page auth, which you have already set
>>> up), and Session auth (which you haven't yet set up).
>>>
>>> In order to set up session-based auth, you should first identify the
>>> set of pages that you want access to that are protected by a cookie
>>> requirement.  You will need to write a regular expression that matches
>>> these pages and ONLY these pages.  This URL gets entered as the "URL
>>> regular expression" on the Access Credentials tab in the Session-based
>>> Access Credentials part of the tab.  Then, click the Add button.
>>>
>>> The next thing you will need is to specify how the connector
>>> recognizes pages that belong to the logon sequence.  The actual
>>> sequence you need to understand is what happens in the browser when
>>> you try to access a specific protected URL and you don't have the
>>> right cookie.  You did not actually specify that; I think you are
>>> presuming that you'd be entering directly through the logon page, but
>>> that is not how it works.  The crawler will have a URL in mind and
>>> will need access to the content of that URL.  It will fetch the URL,
>>> and if the actual content is NOT fetched, we need to detect that
>>> situation and consider it part of the logon sequence.
>>>
>>> So let's pretend that what happens when the cookie is not present is
>>> that you get a redirection to the logon page, instead of the actual
>>> page content.  In that case, you would create a login sequence page
>>> description consisting of the same URL regular expression that
>>> describes the protected content pages, plus the "redirection" radio
>>> button, plus a target URL regular expression that would match
>>> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  You then click the Add
>>> button for login pages to add that description to the set of login
>>> pages.
>>>
>>> Next, the GetLogon page itself needs to be added as a login sequence
>>> page.  The regular expression should match only
>>> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon".  The type of the page is
>>> "form" because you said this was a form where you could fill in your
>>> login credentials.  If there is only one form on the page you can
>>> leave the regexp that matches the form name blank since that will
>>> match everything.  Once you click "Add" for this page, you will have
>>> the opportunity to fill in form names and values to post when the form
>>> gets posted.
>>>
>>> It was not clear from your description, once again, what happens after
>>> the Logon page is posted.  If there is a special target page, you need
>>> to include that also in the login sequence so that its content is not
>>> taken.  If there is a redirection back to the original content page,
>>> you'd include that redirection.
>>>
>>> Hopefully this is beginning to make a bit of sense to you; but this is
>>> the general picture, not related to your actual site that closely.
>>> For example, the Javascript redirection you mentioned will not be
>>> processed by ManifoldCF, but that is unnecessary because at the end of
>>> the whole login sequence ManifoldCF automatically goes back to the
>>> original URL when the login sequence is chased to its end.  So all you
>>> need to do is make sure that all pages that are part of that sequence
>>> are specified.
>>>
>>> On the other hand, it's not clear that the code you have "protecting"
>>> the site sets cookies any other way than through Javascript.  The
>>> cookie that this Javascript actually sets is a really stupid
>>> non-specific cookie, but unless it is set by the standard response
>>> header method, I don't think it's going to wind up being set at all.
>>> Can you confirm that this is the only way the cookie gets set?
>>>
>>> Karl
>>>
>>> On Tue, May 15, 2012 at 10:57 AM, Rene Nederhand <r...@nederhand.net>
>>> wrote:
>>> > Hi Karl,
>>> >
>>> > Thank you so much for your detailed explanation. I am trying  each
>>> > step you've pointed out. Unfortunately, I cannot get this thing going.
>>> > Hopefully, you can help me if I give you more detailed information.
>>> >
>>> > The sequence of steps is (when accessing https://bb.helo.hanze.nl):
>>> >
>>> > 1.
>>> > https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
>>> > This gives me indeed NTLM authentication. When I create a crawler that
>>> > only crawls the above page I get a 200 response. So this works, no
>>> > 401.
>>> >
>>> > 2. If I submit my username and password. This request is sent to the
>>> > server. This is also the only form I'll ever see.:
>>> >
>>> > https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302)
>>> > Request:
>>> > curl    Z2F
>>> > flags   0
>>> > forcedownlevel  0
>>> > formdir 3
>>> > trusted 0
>>> > username        loginname
>>> > password        mypassword
>>> > SubmitCreds     Log On
>>> >
>>> > 3. The response is a cookie being set with a redirect to the first url
>>> > (but now with the cookie set)
>>> >
>>> > Response:
>>> >        HTTP/1.1 302 Moved Temporarily
>>> > Location        https://bb.helo.hanze.nl/
>>> > Set-Cookie
>>> >  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9";
>>> > HttpOnly; Domain=.hanze.nl; secure; path=/
>>> > Content-Length  0
>>> > Connection      close
>>> >
>>> > Request:
>>> >        GET / HTTP/1.1
>>> > Host    bb.helo.hanze.nl
>>> > User-Agent      Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0)
>>> > Gecko/20100101 Firefox/12.0
>>> > Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>> > Accept-Language en-us,en;q=0.5
>>> > Accept-Encoding gzip, deflate
>>> > Connection      keep-alive
>>> > Referer
>>> > https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3
>>> > Cookie
>>> >  noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"
>>> >
>>> > 4. Lastly, a redirect is made to the Blackboard site (javascript check
>>> > for cookie and redirect)
>>> >
>>> > Response:
>>> > <HTML dir='ltr'><HEAD>
>>> > <META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META
>>> > HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
>>> > <script language="Javascript">
>>> >  cookie_name = "cookies_enabled";
>>> >  document.cookie=cookie_name+"=yes";
>>> >  if (!document.cookie) {
>>> >    document.location.href="/nocookies.html";
>>> >  }
>>> >  document.cookie=cookie_name+"yes;expires=Thu, 01-Jan-1970 00:00:01
>>> > GMT";
>>> > </script>
>>> > <SCRIPT language="Javascript"><!--
>>> >
>>> > document.location.replace('https://bb.helo.hanze.nl/webapps/portal/frameset.jsp');
>>> > //--></SCRIPT></HEAD>
>>> > <BODY BGCOLOR='#FFFFFF' LINK='#000000' ALINK='#000000'>
>>> > <br><br><br><br><div style="text-align: center;"><hr width='350'
>>> > height='5'><br>
>>> > <strong>You are being redirected to another page</strong>
>>> > <p><strong>Please Wait...</strong><br><br><hr width='350' height='5'>
>>> > <br><A
>>> > HREF='https://bb.helo.hanze.nl/webapps/portal/frameset.jsp'><strong>Click
>>> > here to access the page to which you are being
>>> > forwarded.</strong></A></div>
>>> > </BODY></HTML>
>>> >
>>> > Although the first form used NTLM authentication, this doesn't work
>>> > out. Therefore, I would think that session based auth would work
>>> > better as I can create each step myself. I still haven't a clue how to
>>> > approach this. What do I fill in those boxes?
>>> >
>>> > Thanks for helping me.
>>> >
>>> > Cheers,
>>> > René
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, May 11, 2012 at 4:26 PM, Karl Wright <daddy...@gmail.com> wrote:
>>> >> Hi Rene,
>>> >>
>>> >> Crawling through a proxy is usually easy, but crawling a session-based
>>> >> site is always a challenge.
>>> >>
>>> >> ISA proxies usually authenticate with NTLM.  So you will want to set
>>> >> up your web connection with NTLM authentication in order to even be
>>> >> able to reach the pages.  It's not clear that you've got that right
>>> >> yet, because if you don't have it right you will get 401 errors back.
>>> >> Getting this right is a prerequisite; you won't be able to proceed
>>> >> until it is correct.  To see that you do, try a very limited crawl
>>> >> that fetches ONLY the login page (or some other un-session-protected
>>> >> content).  If you get a 401 you'll need to figure out what's not right
>>> >> before proceeding.
>>> >>
>>> >> It sounds like the site may also be secured using session-based
>>> >> authentication.  If a cookie is involved then you need to configure
>>> >> session auth in order to get to any session-protected pages.  The
>>> >> trick is that, for session-based auth, you need to fully understand
>>> >> the sequence of pages and forms that happen when a user visits the
>>> >> site and is granted the cookie(s) - the login process, what content
>>> >> URLs are protected, what URLs are part of the login sequence, etc.
>>> >> The end-user documentation describes this in some detail.  It can be a
>>> >> challenge to get it all set up right.
>>> >>
>>> >> Finally, for SharePoint sites, if you are intending to index
>>> >> documents, you might well find the SharePoint Connector a better
>>> >> choice than trying to crawl the site with the web connector.
>>> >>
>>> >> Thanks,
>>> >> Karl
>>> >>
>>> >> On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <r...@nederhand.net>
>>> >> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> I am trying to get ManifoldCF crawl our electronic learning
>>> >>> environment (Blackboard). To enable single sign-on, our institution
>>> >>> has placed an ISA server as proxy before Blackboard.
>>> >>> This is giving me a lot of problems.
>>> >>>
>>> >>> I've managed to get passed the ISA server using session based
>>> >>> authentication, but then I am stuck at a 401 error message. According
>>> >>> to our architect, ISA is responsible for the communication with
>>> >>> Blackboard and will set a cookie so Blackboard will know it a
>>> >>> legitimate user is accessing its service. I think, ManifoldCF is not
>>> >>> able to handle this cookie and hence is not able to access Blackboard.
>>> >>> Am I right? If so, is there a possibility to get Blackboard indexed?
>>> >>>
>>> >>> By the way, the same authentication is used for our Sharepoint. I
>>> >>> would like to index this as well....
>>> >>>
>>> >>> Any help on solving this problem is appreciated.
>>> >>>
>>> >>> Cheers,
>>> >>>
>>> >>> René
>>
>>

Re: Crawling behind an ISA proxy (iis 7.5)

Reply via email to