There seems to be only two differences.  The Host header value is
different, and there is an Accept header in the one that works.
(Accept: */*)

I will experiment with curl this evening to see which of these is
causing the problem.  Or, if you don't want to wait, you can use curl
and explicitly set these headers to see which one causes it to fail.

Thanks,
Karl


On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
<[email protected]> wrote:
> Thank you for your navigation.
> I got a log from MCF 1.0.1.
>
> A) a log from curl
>
> curl -vvv "http://lucene.jugem.jp/?eid=39";
> * About to connect() to lucene.jugem.jp port 80 (#0)
> *   Trying 210.172.160.170... connected
> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>> GET /?eid=39 HTTP/1.1
>> User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 
>> OpenSSL/0.9.8r zlib/1.2.3
>> Host: lucene.jugem.jp
>> Accept: */*
>>
> < HTTP/1.1 200 OK
> < Date: Wed, 09 Jan 2013 13:23:15 GMT
> < Server: Apache/2.0.59 (Unix)
> < Vary: User-Agent,Host,Accept-Encoding
> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
> < Accept-Ranges: bytes
> < Content-Length: 22594
> < Cache-Control: private
> < Pragma: no-cache
> < Connection: close
> < Content-Type: text/html
>
>
> B) a log from MCF 1.0.1
>
> DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 
> 210.172.160.170:80
> DEBUG 2013-01-09 23:40:11,436 (Thread-472) - >> "GET /?eid=39 
> HTTP/1.1[\r][\n]"
> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: 
> lucene.jugem.jp
> DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "User-Agent: Mozilla/5.0 
> (ApacheManifoldCFWebCrawler; [email protected])[\r][\n]"
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "From: 
> [email protected][\r][\n]"
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "Host: 
> lucene.jugem.jp[\r][\n]"
> DEBUG 2013-01-09 23:40:11,447 (Thread-472) - >> "[\r][\n]"
> DEBUG 2013-01-09 23:40:11,629 (Thread-472) - << "HTTP/1.1 200 OK[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Date: Wed, 09 Jan 2013 
> 14:39:24 GMT[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Server: Apache/2.0.59 
> (Unix)[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Vary: 
> User-Agent,Host,Accept-Encoding[\r][\n]"
> DEBUG 2013-01-09 23:40:11,632 (Thread-472) - << "Last-Modified: Tue, 08 Jan 
> 2013 07:58:33 GMT[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Accept-Ranges: bytes[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Length: 
> 22594[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Cache-Control: 
> private[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Pragma: no-cache[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Connection: close[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "Content-Type: 
> text/html[\r][\n]"
> DEBUG 2013-01-09 23:40:11,633 (Thread-472) - << "[\r][\n]"
> DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection 
> in response to directive: close
>
> Is it enough to diagnose?
>
> Thank you very much,
> Shinichiro
>
>
>
>
> On 2013/01/09, at 23:12, Karl Wright wrote:
>
>> Wire debugging with MCF 1.0.1 requires different logging.ini
>> parameters, because it uses commons-httpclient instead.  That's
>> described here:
>>
>> http://hc.apache.org/httpclient-3.x/logging.html
>>
>> I will need a working comparison to diagnose what is happening, so
>> please either get a log from curl, or better yet from MCF 1.0.1.
>>
>> Thanks!
>> Karl
>>
>>
>> On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> I did wire debugging:
>>> curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 
>>> 200.
>>>
>>> The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.
>>>
>>> [1]
>>> DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 
>>> HTTP/1.1
>>> DEBUG 2013-01-09 22:07:26,495 (Thread-474) - >> "GET /?eid=39 
>>> HTTP/1.1[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,496 (Thread-474) - >> "User-Agent: Mozilla/5.0 
>>> (ApacheManifoldCFWebCrawler; [email protected])[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "From: 
>>> [email protected][\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Host: 
>>> lucene.jugem.jp:80[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "Connection: 
>>> Keep-Alive[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> "[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> GET /?eid=39 HTTP/1.1
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> User-Agent: Mozilla/5.0 
>>> (ApacheManifoldCFWebCrawler; [email protected])
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> From: 
>>> [email protected]
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Host: lucene.jugem.jp:80
>>> DEBUG 2013-01-09 22:07:26,497 (Thread-474) - >> Connection: Keep-Alive
>>> DEBUG 2013-01-09 22:07:26,556 (Thread-474) - << "HTTP/1.1 302 Found[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Date: Wed, 09 Jan 2013 
>>> 13:06:39 GMT[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,561 (Thread-474) - << "Server: Apache/2.0.59 
>>> (Unix)[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Location: 
>>> http://error.jugem.jp/[\r][\n]";
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Length: 
>>> 285[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Connection: close[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "Content-Type: text/html; 
>>> charset=iso-8859-1[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,562 (Thread-474) - << "[\r][\n]"
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 
>>> 302 Found
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << HTTP/1.1 302 Found
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Date: Wed, 09 Jan 2013 
>>> 13:06:39 GMT
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Server: Apache/2.0.59 (Unix)
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Location: 
>>> http://error.jugem.jp/
>>> DEBUG 2013-01-09 22:07:26,563 (Thread-474) - << Content-Length: 285
>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Connection: close
>>> DEBUG 2013-01-09 22:07:26,564 (Thread-474) - << Content-Type: text/html; 
>>> charset=iso-8859-1
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<!DOCTYPE HTML PUBLIC 
>>> "-//IETF//DTD HTML 2.0//EN">[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<html><head>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<title>302 
>>> Found</title>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "</head><body>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<h1>Found</h1>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<p>The document has moved 
>>> <a href="http://error.jugem.jp/";>here</a>.</p>[\n]"
>>> DEBUG 2013-01-09 22:07:26,575 (Thread-474) - << "<hr>[\n]"
>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "<address>Apache/2.0.59 
>>> (Unix) Server at lucene.jugem.jp Port 80</address>[\n]"
>>> DEBUG 2013-01-09 22:07:26,576 (Thread-474) - << "</body></html>[\n]"
>>> DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 
>>> 0.0.0.0:56784<->210.172.160.170:80 closed
>>>
>>>
>>>
>>> Hmm.. It looks like moving to the error location anyway.
>>>
>>> Thanks,
>>> Shinichiro Abe
>>>
>>>
>>> On 2013/01/09, at 21:08, Karl Wright wrote:
>>>
>>>> Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
>>>> Koji's blog site does not like one of the headers, crawler-agent
>>>> perhaps?
>>>>
>>>> I am behind a firewall now but I will explore this later today.  In
>>>> the meantime, if you want to research the problem, could you turn on
>>>> wire debugging?  You do this in the logging.ini file following these
>>>> instructions:
>>>>
>>>> http://hc.apache.org/httpcomponents-client-ga/logging.html
>>>>
>>>> You should see everything happening in the log then, and you can then
>>>> compare against curl using -vvv.  Please let me know what you find.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>> On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
>>>> <[email protected]> wrote:
>>>>> I'm using web connector.
>>>>>
>>>>>> Are you trying to crawl through a proxy?
>>>>> No. I just set seeds that url without a proxy.
>>>>> (Also I didn't obey robots.txt)
>>>>>
>>>>> Using curl, it is the same as your result.
>>>>>
>>>>> Could you reproduce that?
>>>>>
>>>>> Shinichiro
>>>>>
>>>>> On 2013/01/09, at 17:49, Karl Wright wrote:
>>>>>
>>>>>> When I try the URL you gave using curl and no special arguments, I get 
>>>>>> this:
>>>>>>
>>>>>>
>>>>>> C:\Users\Karl>curl -vvv "http://lucene.jugem.jp/?eid=39";
>>>>>> * About to connect() to lucene.jugem.jp port 80 (#0)
>>>>>> *   Trying 210.172.160.170... connected
>>>>>> * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
>>>>>>> GET /?eid=39 HTTP/1.1
>>>>>>> User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c 
>>>>>>> zlib/1.2
>>>>>> .5 librtmp/2.3
>>>>>>> Host: lucene.jugem.jp
>>>>>>> Accept: */*
>>>>>>>
>>>>>> < HTTP/1.1 200 OK
>>>>>> < Date: Wed, 09 Jan 2013 08:47:52 GMT
>>>>>> < Server: Apache/2.0.59 (Unix)
>>>>>> < Vary: User-Agent,Host,Accept-Encoding
>>>>>> < Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
>>>>>> < Accept-Ranges: bytes
>>>>>> < Content-Length: 22594
>>>>>> < Cache-Control: private
>>>>>> < Pragma: no-cache
>>>>>> < Connection: close
>>>>>> < Content-Type: text/html
>>>>>>
>>>>>> There's no 302 from here.
>>>>>>
>>>>>> Are you trying to crawl through a proxy?  If so, that might be where
>>>>>> the problem lies.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright <[email protected]> wrote:
>>>>>>> It sounds like the httpclient upgrade definitely broke something.  We
>>>>>>> should open a ticket.
>>>>>>>
>>>>>>> But first, can you confirm what connector this is?  Is it the web
>>>>>>> connector?  If so, I am puzzled because the web connector has always
>>>>>>> logged any 302 return, but then queued a second document which it
>>>>>>> subsequently fetches.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
>>>>>>> <[email protected]> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm using trunk code and crawling web site with seeds which have 
>>>>>>>> http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
>>>>>>>> As I'm look at Simple History, it shows 302 result code at fetch 
>>>>>>>> activity and doesn't ingest document.
>>>>>>>>
>>>>>>>> When I used MCF 1.0.1 in the same situation, Simple History showed 200 
>>>>>>>> result code and MCF could ingest documents.
>>>>>>>>
>>>>>>>> Why does the trunk shows 302 status? Is it relevant to upgrading 
>>>>>>>> httpclient?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Shinichiro Abe
>>>>>
>>>
>

Reply via email to