Re: Http status code 302

2013-01-09 Thread Karl Wright
When I try the URL you gave using curl and no special arguments, I get this:


C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39;
* About to connect() to lucene.jugem.jp port 80 (#0)
*   Trying 210.172.160.170... connected
* Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c zlib/1.2
.5 librtmp/2.3
 Host: lucene.jugem.jp
 Accept: */*

 HTTP/1.1 200 OK
 Date: Wed, 09 Jan 2013 08:47:52 GMT
 Server: Apache/2.0.59 (Unix)
 Vary: User-Agent,Host,Accept-Encoding
 Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
 Accept-Ranges: bytes
 Content-Length: 22594
 Cache-Control: private
 Pragma: no-cache
 Connection: close
 Content-Type: text/html

There's no 302 from here.

Are you trying to crawl through a proxy?  If so, that might be where
the problem lies.

Karl

On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright daddy...@gmail.com wrote:
 It sounds like the httpclient upgrade definitely broke something.  We
 should open a ticket.

 But first, can you confirm what connector this is?  Is it the web
 connector?  If so, I am puzzled because the web connector has always
 logged any 302 return, but then queued a second document which it
 subsequently fetches.

 Karl

 On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I'm using trunk code and crawling web site with seeds which have 
 http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
 As I'm look at Simple History, it shows 302 result code at fetch activity 
 and doesn't ingest document.

 When I used MCF 1.0.1 in the same situation, Simple History showed 200 
 result code and MCF could ingest documents.

 Why does the trunk shows 302 status? Is it relevant to upgrading httpclient?

 Thanks in advance,
 Shinichiro Abe


Re: Http status code 302

2013-01-09 Thread Karl Wright
Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
Koji's blog site does not like one of the headers, crawler-agent
perhaps?

I am behind a firewall now but I will explore this later today.  In
the meantime, if you want to research the problem, could you turn on
wire debugging?  You do this in the logging.ini file following these
instructions:

http://hc.apache.org/httpcomponents-client-ga/logging.html

You should see everything happening in the log then, and you can then
compare against curl using -vvv.  Please let me know what you find.

Thanks!
Karl

On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 I'm using web connector.

 Are you trying to crawl through a proxy?
 No. I just set seeds that url without a proxy.
 (Also I didn't obey robots.txt)

 Using curl, it is the same as your result.

 Could you reproduce that?

 Shinichiro

 On 2013/01/09, at 17:49, Karl Wright wrote:

 When I try the URL you gave using curl and no special arguments, I get this:


 C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c 
 zlib/1.2
 .5 librtmp/2.3
 Host: lucene.jugem.jp
 Accept: */*

  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 08:47:52 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  Content-Length: 22594
  Cache-Control: private
  Pragma: no-cache
  Connection: close
  Content-Type: text/html

 There's no 302 from here.

 Are you trying to crawl through a proxy?  If so, that might be where
 the problem lies.

 Karl

 On Wed, Jan 9, 2013 at 3:40 AM, Karl Wright daddy...@gmail.com wrote:
 It sounds like the httpclient upgrade definitely broke something.  We
 should open a ticket.

 But first, can you confirm what connector this is?  Is it the web
 connector?  If so, I am puzzled because the web connector has always
 logged any 302 return, but then queued a second document which it
 subsequently fetches.

 Karl

 On Wed, Jan 9, 2013 at 2:10 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I'm using trunk code and crawling web site with seeds which have 
 http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
 As I'm look at Simple History, it shows 302 result code at fetch activity 
 and doesn't ingest document.

 When I used MCF 1.0.1 in the same situation, Simple History showed 200 
 result code and MCF could ingest documents.

 Why does the trunk shows 302 status? Is it relevant to upgrading 
 httpclient?

 Thanks in advance,
 Shinichiro Abe



Re: Http status code 302

2013-01-09 Thread Shinichiro Abe
Hi,

I did wire debugging:
curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 200.

The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.

[1]
DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 
HTTP/1.1
DEBUG 2013-01-09 22:07:26,495 (Thread-474) -  GET /?eid=39 HTTP/1.1[\r][\n]
DEBUG 2013-01-09 22:07:26,496 (Thread-474) -  User-Agent: Mozilla/5.0 
(ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
shinichiro.ab...@gmail.com[\r][\n]
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: 
lucene.jugem.jp:80[\r][\n]
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: Keep-Alive[\r][\n]
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  [\r][\n]
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  GET /?eid=39 HTTP/1.1
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  User-Agent: Mozilla/5.0 
(ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: shinichiro.ab...@gmail.com
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: lucene.jugem.jp:80
DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: Keep-Alive
DEBUG 2013-01-09 22:07:26,556 (Thread-474) -  HTTP/1.1 302 Found[\r][\n]
DEBUG 2013-01-09 22:07:26,561 (Thread-474) -  Date: Wed, 09 Jan 2013 
13:06:39 GMT[\r][\n]
DEBUG 2013-01-09 22:07:26,561 (Thread-474) -  Server: Apache/2.0.59 
(Unix)[\r][\n]
DEBUG 2013-01-09 22:07:26,562 (Thread-474) -  Location: 
http://error.jugem.jp/[\r][\n];
DEBUG 2013-01-09 22:07:26,562 (Thread-474) -  Content-Length: 285[\r][\n]
DEBUG 2013-01-09 22:07:26,562 (Thread-474) -  Connection: close[\r][\n]
DEBUG 2013-01-09 22:07:26,562 (Thread-474) -  Content-Type: text/html; 
charset=iso-8859-1[\r][\n]
DEBUG 2013-01-09 22:07:26,562 (Thread-474) -  [\r][\n]
DEBUG 2013-01-09 22:07:26,563 (Thread-474) - Receiving response: HTTP/1.1 302 
Found
DEBUG 2013-01-09 22:07:26,563 (Thread-474) -  HTTP/1.1 302 Found
DEBUG 2013-01-09 22:07:26,563 (Thread-474) -  Date: Wed, 09 Jan 2013 13:06:39 
GMT
DEBUG 2013-01-09 22:07:26,563 (Thread-474) -  Server: Apache/2.0.59 (Unix)
DEBUG 2013-01-09 22:07:26,563 (Thread-474) -  Location: http://error.jugem.jp/
DEBUG 2013-01-09 22:07:26,563 (Thread-474) -  Content-Length: 285
DEBUG 2013-01-09 22:07:26,564 (Thread-474) -  Connection: close
DEBUG 2013-01-09 22:07:26,564 (Thread-474) -  Content-Type: text/html; 
charset=iso-8859-1
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  !DOCTYPE HTML PUBLIC 
-//IETF//DTD HTML 2.0//EN[\n]
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  htmlhead[\n]
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  title302 Found/title[\n]
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  /headbody[\n]
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  h1Found/h1[\n]
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  pThe document has moved a 
href=http://error.jugem.jp/;here/a./p[\n]
DEBUG 2013-01-09 22:07:26,575 (Thread-474) -  hr[\n]
DEBUG 2013-01-09 22:07:26,576 (Thread-474) -  addressApache/2.0.59 (Unix) 
Server at lucene.jugem.jp Port 80/address[\n]
DEBUG 2013-01-09 22:07:26,576 (Thread-474) -  /body/html[\n]
DEBUG 2013-01-09 22:07:26,618 (Thread-474) - Connection 
0.0.0.0:56784-210.172.160.170:80 closed



Hmm.. It looks like moving to the error location anyway.

Thanks,
Shinichiro Abe


On 2013/01/09, at 21:08, Karl Wright wrote:

 Odd that curl would yield a 200 while ManifoldCF gets a 302.  Maybe
 Koji's blog site does not like one of the headers, crawler-agent
 perhaps?
 
 I am behind a firewall now but I will explore this later today.  In
 the meantime, if you want to research the problem, could you turn on
 wire debugging?  You do this in the logging.ini file following these
 instructions:
 
 http://hc.apache.org/httpcomponents-client-ga/logging.html
 
 You should see everything happening in the log then, and you can then
 compare against curl using -vvv.  Please let me know what you find.
 
 Thanks!
 Karl
 
 On Wed, Jan 9, 2013 at 4:29 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 I'm using web connector.
 
 Are you trying to crawl through a proxy?
 No. I just set seeds that url without a proxy.
 (Also I didn't obey robots.txt)
 
 Using curl, it is the same as your result.
 
 Could you reproduce that?
 
 Shinichiro
 
 On 2013/01/09, at 17:49, Karl Wright wrote:
 
 When I try the URL you gave using curl and no special arguments, I get this:
 
 
 C:\Users\Karlcurl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.21.7 (i386-pc-win32) libcurl/7.21.7 OpenSSL/1.0.0c 
 zlib/1.2
 .5 librtmp/2.3
 Host: lucene.jugem.jp
 Accept: */*
 
  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 08:47:52 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  

Re: Http status code 302

2013-01-09 Thread Karl Wright
There seems to be only two differences.  The Host header value is
different, and there is an Accept header in the one that works.
(Accept: */*)

I will experiment with curl this evening to see which of these is
causing the problem.  Or, if you don't want to wait, you can use curl
and explicitly set these headers to see which one causes it to fail.

Thanks,
Karl


On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Thank you for your navigation.
 I got a log from MCF 1.0.1.

 A) a log from curl

 curl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 
 OpenSSL/0.9.8r zlib/1.2.3
 Host: lucene.jugem.jp
 Accept: */*

  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 13:23:15 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  Content-Length: 22594
  Cache-Control: private
  Pragma: no-cache
  Connection: close
  Content-Type: text/html


 B) a log from MCF 1.0.1

 DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 
 210.172.160.170:80
 DEBUG 2013-01-09 23:40:11,436 (Thread-472) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: 
 lucene.jugem.jp
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  Host: 
 lucene.jugem.jp[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:11,629 (Thread-472) -  HTTP/1.1 200 OK[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Date: Wed, 09 Jan 2013 
 14:39:24 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Server: Apache/2.0.59 
 (Unix)[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Vary: 
 User-Agent,Host,Accept-Encoding[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Last-Modified: Tue, 08 Jan 
 2013 07:58:33 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Accept-Ranges: bytes[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Length: 
 22594[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Cache-Control: 
 private[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Pragma: no-cache[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Connection: close[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Type: 
 text/html[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection 
 in response to directive: close

 Is it enough to diagnose?

 Thank you very much,
 Shinichiro




 On 2013/01/09, at 23:12, Karl Wright wrote:

 Wire debugging with MCF 1.0.1 requires different logging.ini
 parameters, because it uses commons-httpclient instead.  That's
 described here:

 http://hc.apache.org/httpclient-3.x/logging.html

 I will need a working comparison to diagnose what is happening, so
 please either get a log from curl, or better yet from MCF 1.0.1.

 Thanks!
 Karl


 On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I did wire debugging:
 curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got a 
 200.

 The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.

 [1]
 DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 
 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,495 (Thread-474) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 22:07:26,496 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: 
 lucene.jugem.jp:80[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: 
 Keep-Alive[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  [\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  GET /?eid=39 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: lucene.jugem.jp:80
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: Keep-Alive
 DEBUG 2013-01-09 22:07:26,556 (Thread-474) -  HTTP/1.1 302 Found[\r][\n]
 DEBUG 2013-01-09 22:07:26,561 (Thread-474) -  Date: Wed, 09 Jan 2013 
 13:06:39 GMT[\r][\n]
 DEBUG 2013-01-09 22:07:26,561 (Thread-474) -  Server: 

Re: Http status code 302

2013-01-09 Thread Karl Wright
I created CONNECTORS-604 to track this problem.

Karl

On Wed, Jan 9, 2013 at 10:02 AM, Karl Wright daddy...@gmail.com wrote:
 There seems to be only two differences.  The Host header value is
 different, and there is an Accept header in the one that works.
 (Accept: */*)

 I will experiment with curl this evening to see which of these is
 causing the problem.  Or, if you don't want to wait, you can use curl
 and explicitly set these headers to see which one causes it to fail.

 Thanks,
 Karl


 On Wed, Jan 9, 2013 at 9:56 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Thank you for your navigation.
 I got a log from MCF 1.0.1.

 A) a log from curl

 curl -vvv http://lucene.jugem.jp/?eid=39;
 * About to connect() to lucene.jugem.jp port 80 (#0)
 *   Trying 210.172.160.170... connected
 * Connected to lucene.jugem.jp (210.172.160.170) port 80 (#0)
 GET /?eid=39 HTTP/1.1
 User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 
 OpenSSL/0.9.8r zlib/1.2.3
 Host: lucene.jugem.jp
 Accept: */*

  HTTP/1.1 200 OK
  Date: Wed, 09 Jan 2013 13:23:15 GMT
  Server: Apache/2.0.59 (Unix)
  Vary: User-Agent,Host,Accept-Encoding
  Last-Modified: Tue, 08 Jan 2013 07:58:33 GMT
  Accept-Ranges: bytes
  Content-Length: 22594
  Cache-Control: private
  Pragma: no-cache
  Connection: close
  Content-Type: text/html


 B) a log from MCF 1.0.1

 DEBUG 2013-01-09 23:40:11,313 (Thread-472) - Open connection to 
 210.172.160.170:80
 DEBUG 2013-01-09 23:40:11,436 (Thread-472) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Using virtual host name: 
 lucene.jugem.jp
 DEBUG 2013-01-09 23:40:11,437 (Thread-472) - Adding Host request header
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  Host: 
 lucene.jugem.jp[\r][\n]
 DEBUG 2013-01-09 23:40:11,447 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:11,629 (Thread-472) -  HTTP/1.1 200 OK[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Date: Wed, 09 Jan 2013 
 14:39:24 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Server: Apache/2.0.59 
 (Unix)[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Vary: 
 User-Agent,Host,Accept-Encoding[\r][\n]
 DEBUG 2013-01-09 23:40:11,632 (Thread-472) -  Last-Modified: Tue, 08 Jan 
 2013 07:58:33 GMT[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Accept-Ranges: 
 bytes[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Length: 
 22594[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Cache-Control: 
 private[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Pragma: no-cache[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Connection: close[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  Content-Type: 
 text/html[\r][\n]
 DEBUG 2013-01-09 23:40:11,633 (Thread-472) -  [\r][\n]
 DEBUG 2013-01-09 23:40:12,054 (Worker thread '0') - Should close connection 
 in response to directive: close

 Is it enough to diagnose?

 Thank you very much,
 Shinichiro




 On 2013/01/09, at 23:12, Karl Wright wrote:

 Wire debugging with MCF 1.0.1 requires different logging.ini
 parameters, because it uses commons-httpclient instead.  That's
 described here:

 http://hc.apache.org/httpclient-3.x/logging.html

 I will need a working comparison to diagnose what is happening, so
 please either get a log from curl, or better yet from MCF 1.0.1.

 Thanks!
 Karl


 On Wed, Jan 9, 2013 at 9:04 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi,

 I did wire debugging:
 curl yielded a 200 while ManifoldCF trunk got a 302, ManifoldCF 1.0.1 got 
 a 200.

 The manifoldcf.log of trunk showed logs[1] but one of 1.0.1 showed no logs.

 [1]
 DEBUG 2013-01-09 22:07:26,494 (Thread-474) - Sending request: GET /?eid=39 
 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,495 (Thread-474) -  GET /?eid=39 
 HTTP/1.1[\r][\n]
 DEBUG 2013-01-09 22:07:26,496 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: 
 lucene.jugem.jp:80[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: 
 Keep-Alive[\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  [\r][\n]
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  GET /?eid=39 HTTP/1.1
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  User-Agent: Mozilla/5.0 
 (ApacheManifoldCFWebCrawler; shinichiro.ab...@gmail.com)
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  From: 
 shinichiro.ab...@gmail.com
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Host: lucene.jugem.jp:80
 DEBUG 2013-01-09 22:07:26,497 (Thread-474) -  Connection: Keep-Alive
 DEBUG 2013-01-09 22:07:26,556 (Thread-474) -  HTTP/1.1 302 
 Found[\r][\n]
 DEBUG 

Http status code 302

2013-01-08 Thread Shinichiro Abe
Hi,

I'm using trunk code and crawling web site with seeds which have 
http://lucene.jugem.jp/?eid=39 (koji's blog --I don't obey robots.txt).
As I'm look at Simple History, it shows 302 result code at fetch activity and 
doesn't ingest document.

When I used MCF 1.0.1 in the same situation, Simple History showed 200 result 
code and MCF could ingest documents.

Why does the trunk shows 302 status? Is it relevant to upgrading httpclient?

Thanks in advance,
Shinichiro Abe