Hi, I did wire debugging: curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200.
The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs. [1] DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 HTTP/1.1 DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 HTTP/1.1[\r][\n]" DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; [email protected])[\r][\n]" DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: [email protected][\r][\n]" DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: lucene.jugem.jp:80[\r][\n]" DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: Keep-Alive[\r][\n]" DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]" DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1 DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; [email protected]) DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: [email protected] DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80 DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 Found[\r][\n]" DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan 2013 13:06:39 GMT[\r][\n]" DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59 (Unix)[\r][\n]" DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: http://error.jugem.jp/[\r][\n]" DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: 285[\r][\n]" DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]" DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html; charset=iso-8859-1[\r][\n]" DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]" DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 302 Found DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013 13:06:39 GMT DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59 (Unix) DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: http://error.jugem.jp/ DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285 DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html; charset=iso-8859-1 DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]" DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<html><head>[\n]" DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<title>302 Found</title>[\n]" DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "</head><body>[\n]" DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<h1>Found</h1>[\n]" DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<p>The document has moved <a href="http://error.jugem.jp/">here</a>.</p>[\n]" DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<hr>[\n]" DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "<address>Apache/2.0.59 (Unix) Server at lucene.jugem.jp Port 80</address>[\n]" DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "</body></html>[\n]" DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 0.0.0.0:56784<->210.172.160.170:80 closed Hmm.. It looks like moving to the error location anyway. Thanks, Shinichiro Abe On 2013/01/09, at 21:08, Karl Wright wrote: > Odd that curl would yield a 200 while ManifoldCF gets a 302. Maybe > Koji's blog site does not like one of the headers, crawler-agent > perhaps? > > I am behind a firewall now but I will explore this later today. In > the meantime, if you want to research the problem, could you turn on > wire debugging? You do this in the logging.ini file following these > instructions: > > http://hc.apache.org/httpcomponents-client-ga/logging.html > > You should see everything happening in the log then, and you can then > compare against curl using -vvv. Please let me know what you find. > > Thanks! > Karl > > On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe > <[email protected]> wrote: >> I'm using web connector. >> >>> Are you trying to crawl through a proxy? >> No. I just set seeds that url without a proxy. >> (Also I didn't obey robots.txt) >> >> Using curl, it is the same as your result. >> >> Could you reproduce that? >> >> Shinichiro >> >> On 2013/01/09, at 17:49, Karl Wright wrote: >> >>> When I try the URL you gave using curl and no special arguments, I get this: >>> >>> >>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39" >>> * About to connect() to lucene.jugem.jp port 80 (#0) >>> * Trying 210.172.160.170... connected >>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0) >>>> GET /?eid=39 HTTP/1.1 >>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c >>>> zlib/1.2 >>> .5 librtmp/2.3 >>>> Host: lucene.jugem.jp >>>> Accept: */* >>>> >>> < HTTP/1.1 200 OK >>> < Date: Wed, 09 Jan 2013 08:47:52 GMT >>> < Server: Apache/2.0.59 (Unix) >>> < Vary: User-Agent,Host,Accept-Encoding >>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT >>> < Accept-Ranges: bytes >>> < Content-Length: 22594 >>> < Cache-Control: private >>> < Pragma: no-cache >>> < Connection: close >>> < Content-Type: text/html >>> >>> There's no 302 from here. >>> >>> Are you trying to crawl through a proxy? If so, that might be where >>> the problem lies. >>> >>> Karl >>> >>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <[email protected]> wrote: >>>> It sounds like the httpclient upgrade definitely broke something. We >>>> should open a ticket. >>>> >>>> But first, can you confirm what connector this is? Is it the web >>>> connector? If so, I am puzzled because the web connector has always >>>> logged any 302 return, but then queued a second document which it >>>> subsequently fetches. >>>> >>>> Karl >>>> >>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe >>>> <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> I'm using trunk code and crawling web site with seeds which have >>>>> http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt). >>>>> As I'm look at Simple History, it shows 302 result code at fetch activity >>>>> and doesn't ingest document. >>>>> >>>>> When I used MCF 1.0.1 in the same situation, Simple History showed 200 >>>>> result code and MCF could ingest documents. >>>>> >>>>> Why does the trunk shows 302 status? Is it relevant to upgrading >>>>> httpclient? >>>>> >>>>> Thanks in advance, >>>>> Shinichiro Abe >>
