I was wondering if you'd picked up and tried the patch for CONNECTORS-483. This patch adds official proxy support for the Web Connector. Alternatively, you could try to build and run with trunk code.
Karl On Wed, May 16, 2012 at 12:12 PM, Karl Wright <daddy...@gmail.com> wrote: > Hi Rene, > > The URL that is causing the RFC2617 challenge/response is being > authenticated with basic auth, not NTLM. This could yield a 401. You > may want to check the URL in a browser other than IE (Firefox, for > instance) to see if basic auth is being used for this URL rather than > NTLM. > > The redirection you describe to GetLogon is pretty standard practice. > You can easily tell the web connector that that is part of the logon > sequence by following the steps I laid out in the earlier email. > > Once you have set up what you think is the right set of logon pages, > it's very helpful to attempt a crawl and then see what the simple > history shows. There are specific activities logged when logon begins > and ends, so this is enormously helpful as a diagnostic aid. If you > see a continuous loop (entering logon sequence, doing stuff, exiting > logon sequence, and repeating) then it is clear that the cookie has > not been set. > > I won't be able to look at your packet log for a while, probably at > least a week. > > Karl > > > > On Wed, May 16, 2012 at 10:23 AM, Rene Nederhand <r...@nederhand.net> wrote: >> Hi Karl, >> >> Thank you so much for putting a so much time in educating a newbe. I >> appreciate your help enormously. >> >> I'd tried to follow each of the steps below. So far, it doesn't work but I >> will continue this evening to see if I can get this thing going. >> >> In the mean time, I have switched loglevels of the crawling proces to "INFO" >> and found something interesting in the logs. Perhaps, this could shine some >> light on my issues: >> >> ERROR 2012-05-16 16:04:13,581 (Thread-1019) - Invalid challenge: Basic >> org.apache.commons.httpclient.auth.MalformedChallengeException: Invalid >> challenge: Basic >> at >> org.apache.commons.httpclient.auth.AuthChallengeParser.extractParams(Unknown >> Source) >> at org.apache.commons.httpclient.auth.RFC2617Scheme.processChallenge(Unknown >> Source) >> at org.apache.commons.httpclient.auth.BasicScheme.processChallenge(Unknown >> Source) >> at >> org.apache.commons.httpclient.auth.AuthChallengeProcessor.processChallenge(Unknown >> Source) >> at >> org.apache.commons.httpclient.HttpMethodDirector.processWWWAuthChallenge(Unknown >> Source) >> at >> org.apache.commons.httpclient.HttpMethodDirector.processAuthenticationResponse(Unknown >> Source) >> at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown >> Source) >> at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown Source) >> at >> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244) >> >> Please not that I have set NTLM (not BASIC) authentication on >> "bb.helo.hanze.nl" and nothing else. The error does not occur when I try to >> crawl our intranet (also with NTLM). Does this mean something? At least, I >> think it is the source of the 401 I get when looking at the simple report, >> isn't it? >> >> In addition, I've used Charles proxy to monitor all interaction between my >> browser and the server. I have found that it doesn't matter which url I use >> to enter Blackboard, they get all directed to >> https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon. Shouldn't page based >> authentication handle this? >> >> To make the information complete, I've added the HAR file with the >> CharlesProxy output. It can be displayed >> at http://www.softwareishard.com/har/viewer/ for example. You'll be able to >> see all requests/responses when I start with a clean browser (cookies >> removed) entering https://bb.helo.hanze.nl. Maybe, this does help. >> >> Again, thanks a lot for your help! >> >> René >> >> >> >> >> >> On Tue, May 15, 2012 at 5:59 PM, Karl Wright <daddy...@gmail.com> wrote: >>> >>> Hi Rene, >>> >>> You will need both NTLM auth (page auth, which you have already set >>> up), and Session auth (which you haven't yet set up). >>> >>> In order to set up session-based auth, you should first identify the >>> set of pages that you want access to that are protected by a cookie >>> requirement. You will need to write a regular expression that matches >>> these pages and ONLY these pages. This URL gets entered as the "URL >>> regular expression" on the Access Credentials tab in the Session-based >>> Access Credentials part of the tab. Then, click the Add button. >>> >>> The next thing you will need is to specify how the connector >>> recognizes pages that belong to the logon sequence. The actual >>> sequence you need to understand is what happens in the browser when >>> you try to access a specific protected URL and you don't have the >>> right cookie. You did not actually specify that; I think you are >>> presuming that you'd be entering directly through the logon page, but >>> that is not how it works. The crawler will have a URL in mind and >>> will need access to the content of that URL. It will fetch the URL, >>> and if the actual content is NOT fetched, we need to detect that >>> situation and consider it part of the logon sequence. >>> >>> So let's pretend that what happens when the cookie is not present is >>> that you get a redirection to the logon page, instead of the actual >>> page content. In that case, you would create a login sequence page >>> description consisting of the same URL regular expression that >>> describes the protected content pages, plus the "redirection" radio >>> button, plus a target URL regular expression that would match >>> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon". You then click the Add >>> button for login pages to add that description to the set of login >>> pages. >>> >>> Next, the GetLogon page itself needs to be added as a login sequence >>> page. The regular expression should match only >>> "bb.helo.hanze.nl/CookieAuth.dll?GetLogon". The type of the page is >>> "form" because you said this was a form where you could fill in your >>> login credentials. If there is only one form on the page you can >>> leave the regexp that matches the form name blank since that will >>> match everything. Once you click "Add" for this page, you will have >>> the opportunity to fill in form names and values to post when the form >>> gets posted. >>> >>> It was not clear from your description, once again, what happens after >>> the Logon page is posted. If there is a special target page, you need >>> to include that also in the login sequence so that its content is not >>> taken. If there is a redirection back to the original content page, >>> you'd include that redirection. >>> >>> Hopefully this is beginning to make a bit of sense to you; but this is >>> the general picture, not related to your actual site that closely. >>> For example, the Javascript redirection you mentioned will not be >>> processed by ManifoldCF, but that is unnecessary because at the end of >>> the whole login sequence ManifoldCF automatically goes back to the >>> original URL when the login sequence is chased to its end. So all you >>> need to do is make sure that all pages that are part of that sequence >>> are specified. >>> >>> On the other hand, it's not clear that the code you have "protecting" >>> the site sets cookies any other way than through Javascript. The >>> cookie that this Javascript actually sets is a really stupid >>> non-specific cookie, but unless it is set by the standard response >>> header method, I don't think it's going to wind up being set at all. >>> Can you confirm that this is the only way the cookie gets set? >>> >>> Karl >>> >>> On Tue, May 15, 2012 at 10:57 AM, Rene Nederhand <r...@nederhand.net> >>> wrote: >>> > Hi Karl, >>> > >>> > Thank you so much for your detailed explanation. I am trying each >>> > step you've pointed out. Unfortunately, I cannot get this thing going. >>> > Hopefully, you can help me if I give you more detailed information. >>> > >>> > The sequence of steps is (when accessing https://bb.helo.hanze.nl): >>> > >>> > 1. >>> > https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3 >>> > This gives me indeed NTLM authentication. When I create a crawler that >>> > only crawls the above page I get a 200 response. So this works, no >>> > 401. >>> > >>> > 2. If I submit my username and password. This request is sent to the >>> > server. This is also the only form I'll ever see.: >>> > >>> > https://bb.helo.hanze.nl/CookieAuth.dll?Logon (302) >>> > Request: >>> > curl Z2F >>> > flags 0 >>> > forcedownlevel 0 >>> > formdir 3 >>> > trusted 0 >>> > username loginname >>> > password mypassword >>> > SubmitCreds Log On >>> > >>> > 3. The response is a cookie being set with a redirect to the first url >>> > (but now with the cookie set) >>> > >>> > Response: >>> > HTTP/1.1 302 Moved Temporarily >>> > Location https://bb.helo.hanze.nl/ >>> > Set-Cookie >>> > noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9"; >>> > HttpOnly; Domain=.hanze.nl; secure; path=/ >>> > Content-Length 0 >>> > Connection close >>> > >>> > Request: >>> > GET / HTTP/1.1 >>> > Host bb.helo.hanze.nl >>> > User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0) >>> > Gecko/20100101 Firefox/12.0 >>> > Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >>> > Accept-Language en-us,en;q=0.5 >>> > Accept-Encoding gzip, deflate >>> > Connection keep-alive >>> > Referer >>> > https://bb.helo.hanze.nl/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3 >>> > Cookie >>> > noname="2991b0bdb-4057-47e3-b5e9-5e6111bd2974Jcaev8jiltKQd6/PUz1iDNkUTWaUznKpRyu3I9AzzLVKWElBoFRTZAWRZik+qp3wntGyNI2L5GQzjdzyaWogpvMYv93dKChgpwYenrI+uxJgTxiCprPhcRsNs3SYX1p9" >>> > >>> > 4. Lastly, a redirect is made to the Blackboard site (javascript check >>> > for cookie and redirect) >>> > >>> > Response: >>> > <HTML dir='ltr'><HEAD> >>> > <META HTTP-EQUIV="Pragma" CONTENT="no-cache"><META >>> > HTTP-EQUIV="Cache-Control" CONTENT="no-cache"> >>> > <script language="Javascript"> >>> > cookie_name = "cookies_enabled"; >>> > document.cookie=cookie_name+"=yes"; >>> > if (!document.cookie) { >>> > document.location.href="/nocookies.html"; >>> > } >>> > document.cookie=cookie_name+"yes;expires=Thu, 01-Jan-1970 00:00:01 >>> > GMT"; >>> > </script> >>> > <SCRIPT language="Javascript"><!-- >>> > >>> > document.location.replace('https://bb.helo.hanze.nl/webapps/portal/frameset.jsp'); >>> > //--></SCRIPT></HEAD> >>> > <BODY BGCOLOR='#FFFFFF' LINK='#000000' ALINK='#000000'> >>> > <br><br><br><br><div style="text-align: center;"><hr width='350' >>> > height='5'><br> >>> > <strong>You are being redirected to another page</strong> >>> > <p><strong>Please Wait...</strong><br><br><hr width='350' height='5'> >>> > <br><A >>> > HREF='https://bb.helo.hanze.nl/webapps/portal/frameset.jsp'><strong>Click >>> > here to access the page to which you are being >>> > forwarded.</strong></A></div> >>> > </BODY></HTML> >>> > >>> > Although the first form used NTLM authentication, this doesn't work >>> > out. Therefore, I would think that session based auth would work >>> > better as I can create each step myself. I still haven't a clue how to >>> > approach this. What do I fill in those boxes? >>> > >>> > Thanks for helping me. >>> > >>> > Cheers, >>> > René >>> > >>> > >>> > >>> > >>> > On Fri, May 11, 2012 at 4:26 PM, Karl Wright <daddy...@gmail.com> wrote: >>> >> Hi Rene, >>> >> >>> >> Crawling through a proxy is usually easy, but crawling a session-based >>> >> site is always a challenge. >>> >> >>> >> ISA proxies usually authenticate with NTLM. So you will want to set >>> >> up your web connection with NTLM authentication in order to even be >>> >> able to reach the pages. It's not clear that you've got that right >>> >> yet, because if you don't have it right you will get 401 errors back. >>> >> Getting this right is a prerequisite; you won't be able to proceed >>> >> until it is correct. To see that you do, try a very limited crawl >>> >> that fetches ONLY the login page (or some other un-session-protected >>> >> content). If you get a 401 you'll need to figure out what's not right >>> >> before proceeding. >>> >> >>> >> It sounds like the site may also be secured using session-based >>> >> authentication. If a cookie is involved then you need to configure >>> >> session auth in order to get to any session-protected pages. The >>> >> trick is that, for session-based auth, you need to fully understand >>> >> the sequence of pages and forms that happen when a user visits the >>> >> site and is granted the cookie(s) - the login process, what content >>> >> URLs are protected, what URLs are part of the login sequence, etc. >>> >> The end-user documentation describes this in some detail. It can be a >>> >> challenge to get it all set up right. >>> >> >>> >> Finally, for SharePoint sites, if you are intending to index >>> >> documents, you might well find the SharePoint Connector a better >>> >> choice than trying to crawl the site with the web connector. >>> >> >>> >> Thanks, >>> >> Karl >>> >> >>> >> On Fri, May 11, 2012 at 10:13 AM, Rene Nederhand <r...@nederhand.net> >>> >> wrote: >>> >>> Hi, >>> >>> >>> >>> I am trying to get ManifoldCF crawl our electronic learning >>> >>> environment (Blackboard). To enable single sign-on, our institution >>> >>> has placed an ISA server as proxy before Blackboard. >>> >>> This is giving me a lot of problems. >>> >>> >>> >>> I've managed to get passed the ISA server using session based >>> >>> authentication, but then I am stuck at a 401 error message. According >>> >>> to our architect, ISA is responsible for the communication with >>> >>> Blackboard and will set a cookie so Blackboard will know it a >>> >>> legitimate user is accessing its service. I think, ManifoldCF is not >>> >>> able to handle this cookie and hence is not able to access Blackboard. >>> >>> Am I right? If so, is there a possibility to get Blackboard indexed? >>> >>> >>> >>> By the way, the same authentication is used for our Sharepoint. I >>> >>> would like to index this as well.... >>> >>> >>> >>> Any help on solving this problem is appreciated. >>> >>> >>> >>> Cheers, >>> >>> >>> >>> René >> >>