I created CONNECTORS-604 to track this problem. Karl
On Wed, Jan 9, 2013 at 10:02 AM, Karl Wright <[email protected]> wrote: > There seems to be only two differences. The Host header value is > different, and there is an Accept header in the one that works. > (Accept: */*) > > I will experiment with curl this evening to see which of these is > causing the problem. Or, if you don't want to wait, you can use curl > and explicitly set these headers to see which one causes it to fail. > > Thanks, > Karl > > > On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe > <[email protected]> wrote: >> Thank you for your navigation. >> I got a log from MCF 1.0.1. >> >> A) a log from curl >> >> curl -vvv "http://lucene.jugem.jp/?eid=39" >> * About to connect() to lucene.jugem.jp port 80 (#0) >> * Trying 210.172.160.170... connected >> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) >>> GET /?eid=39 HTTP/1.1 >>> User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 >>> OpenSSL/0.9.8r zlib/1.2.3 >>> Host: lucene.jugem.jp >>> Accept: */* >>> >> < HTTP/1.1 200 OK >> < Date: Wed, 09 Jan 2013 13:23:15 GMT >> < Server: Apache/2.0.59 (Unix) >> < Vary: User-Agent,Host,Accept-Encoding >> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT >> < Accept-Ranges: bytes >> < Content-Length: 22594 >> < Cache-Control: private >> < Pragma: no-cache >> < Connection: close >> < Content-Type: text/html >> >> >> B) a log from MCF 1.0.1 >> >> DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to >> 210.172.160.170:80 >> DEBUG 2013-01-09 23:40:11,436 (Thread-472) - >> "GET /?eid=39 >> HTTP/1.1[\r][\n]" >> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: >> lucene.jugem.jp >> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header >> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "User-Agent: Mozilla/5.0 >> (ApacheManifoldCFWebCrawler; [email protected])[\r][\n]" >> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "From: >> [email protected][\r][\n]" >> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "Host: >> lucene.jugem.jp[\r][\n]" >> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "[\r][\n]" >> DEBUG 2013-01-09 23:40:11,629 (Thread-472) - << "HTTP/1.1 200 OK[\r][\n]" >> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Date: Wed, 09 Jan 2013 >> 14:39:24 GMT[\r][\n]" >> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Server: Apache/2.0.59 >> (Unix)[\r][\n]" >> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Vary: >> User-Agent,Host,Accept-Encoding[\r][\n]" >> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Last-Modified: Tue, 08 Jan >> 2013 07:58:33 GMT[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Accept-Ranges: >> bytes[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Length: >> 22594[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Cache-Control: >> private[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Pragma: no-cache[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Connection: close[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Type: >> text/html[\r][\n]" >> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "[\r][\n]" >> DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection >> in response to directive: close >> >> Is it enough to diagnose? >> >> Thank you very much, >> Shinichiro >> >> >> >> >> On 2013/01/09, at 23:12, Karl Wright wrote: >> >>> Wire debugging with MCF 1.0.1 requires different logging.ini >>> parameters, because it uses commons-httpclient instead. That's >>> described here: >>> >>> http://hc.apache.org/httpclient-3.x/logging.html >>> >>> I will need a working comparison to diagnose what is happening, so >>> please either get a log from curl, or better yet from MCF 1.0.1. >>> >>> Thanks! >>> Karl >>> >>> >>> On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe >>> <[email protected]> wrote: >>>> Hi, >>>> >>>> I did wire debugging: >>>> curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got >>>> a 200. >>>> >>>> The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs. >>>> >>>> [1] >>>> DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 >>>> HTTP/1.1 >>>> DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 >>>> HTTP/1.1[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0 >>>> (ApacheManifoldCFWebCrawler; [email protected])[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: >>>> [email protected][\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: >>>> lucene.jugem.jp:80[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: >>>> Keep-Alive[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1 >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0 >>>> (ApacheManifoldCFWebCrawler; [email protected]) >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: >>>> [email protected] >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80 >>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive >>>> DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 >>>> Found[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan 2013 >>>> 13:06:39 GMT[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59 >>>> (Unix)[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: >>>> http://error.jugem.jp/[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: >>>> 285[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html; >>>> charset=iso-8859-1[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]" >>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 >>>> 302 Found >>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found >>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013 >>>> 13:06:39 GMT >>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59 >>>> (Unix) >>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: >>>> http://error.jugem.jp/ >>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285 >>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close >>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html; >>>> charset=iso-8859-1 >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<!DOCTYPE HTML PUBLIC >>>> "-//IETF//DTD HTML 2.0//EN">[\n]" >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<html><head>[\n]" >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<title>302 >>>> Found</title>[\n]" >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "</head><body>[\n]" >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<h1>Found</h1>[\n]" >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<p>The document has moved >>>> <a href="http://error.jugem.jp/">here</a>.</p>[\n]" >>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<hr>[\n]" >>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "<address>Apache/2.0.59 >>>> (Unix) Server at lucene.jugem.jp Port 80</address>[\n]" >>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "</body></html>[\n]" >>>> DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection >>>> 0.0.0.0:56784<->210.172.160.170:80 closed >>>> >>>> >>>> >>>> Hmm.. It looks like moving to the error location anyway. >>>> >>>> Thanks, >>>> Shinichiro Abe >>>> >>>> >>>> On 2013/01/09, at 21:08, Karl Wright wrote: >>>> >>>>> Odd that curl would yield a 200 while ManifoldCF gets a 302. Maybe >>>>> Koji's blog site does not like one of the headers, crawler-agent >>>>> perhaps? >>>>> >>>>> I am behind a firewall now but I will explore this later today. In >>>>> the meantime, if you want to research the problem, could you turn on >>>>> wire debugging? You do this in the logging.ini file following these >>>>> instructions: >>>>> >>>>> http://hc.apache.org/httpcomponents-client-ga/logging.html >>>>> >>>>> You should see everything happening in the log then, and you can then >>>>> compare against curl using -vvv. Please let me know what you find. >>>>> >>>>> Thanks! >>>>> Karl >>>>> >>>>> On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe >>>>> <[email protected]> wrote: >>>>>> I'm using web connector. >>>>>> >>>>>>> Are you trying to crawl through a proxy? >>>>>> No. I just set seeds that url without a proxy. >>>>>> (Also I didn't obey robots.txt) >>>>>> >>>>>> Using curl, it is the same as your result. >>>>>> >>>>>> Could you reproduce that? >>>>>> >>>>>> Shinichiro >>>>>> >>>>>> On 2013/01/09, at 17:49, Karl Wright wrote: >>>>>> >>>>>>> When I try the URL you gave using curl and no special arguments, I get >>>>>>> this: >>>>>>> >>>>>>> >>>>>>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39" >>>>>>> * About to connect() to lucene.jugem.jp port 80 (#0) >>>>>>> * Trying 210.172.160.170... connected >>>>>>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) >>>>>>>> GET /?eid=39 HTTP/1.1 >>>>>>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c >>>>>>>> zlib/1.2 >>>>>>> .5 librtmp/2.3 >>>>>>>> Host: lucene.jugem.jp >>>>>>>> Accept: */* >>>>>>>> >>>>>>> < HTTP/1.1 200 OK >>>>>>> < Date: Wed, 09 Jan 2013 08:47:52 GMT >>>>>>> < Server: Apache/2.0.59 (Unix) >>>>>>> < Vary: User-Agent,Host,Accept-Encoding >>>>>>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT >>>>>>> < Accept-Ranges: bytes >>>>>>> < Content-Length: 22594 >>>>>>> < Cache-Control: private >>>>>>> < Pragma: no-cache >>>>>>> < Connection: close >>>>>>> < Content-Type: text/html >>>>>>> >>>>>>> There's no 302 from here. >>>>>>> >>>>>>> Are you trying to crawl through a proxy? If so, that might be where >>>>>>> the problem lies. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <[email protected]> wrote: >>>>>>>> It sounds like the httpclient upgrade definitely broke something. We >>>>>>>> should open a ticket. >>>>>>>> >>>>>>>> But first, can you confirm what connector this is? Is it the web >>>>>>>> connector? If so, I am puzzled because the web connector has always >>>>>>>> logged any 302 return, but then queued a second document which it >>>>>>>> subsequently fetches. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe >>>>>>>> <[email protected]> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm using trunk code and crawling web site with seeds which have >>>>>>>>> http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey >>>>>>>>> robots.txt). >>>>>>>>> As I'm look at Simple History, it shows 302 result code at fetch >>>>>>>>> activity and doesn't ingest document. >>>>>>>>> >>>>>>>>> When I used MCF 1.0.1 in the same situation, Simple History showed >>>>>>>>> 200 result code and MCF could ingest documents. >>>>>>>>> >>>>>>>>> Why does the trunk shows 302 status? Is it relevant to upgrading >>>>>>>>> httpclient? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> Shinichiro Abe >>>>>> >>>> >>
