Re: Crawling and redirects to the same URL

Elisabeth Adler Mon, 26 Sep 2011 03:47:35 -0700

Hi,

Can anyone help me with the problem that redirecting to the same pagedoes not index the page after the redirect, but only before the redirect?

Any pointers appreciated!
Elisabeth


On 21.09.2011 17:27, Elisabeth Adler wrote:

Hi,
I have narrowed the problem down a bit and have a simple example toreplicate the issue:I have a php page which checks if a session variable is set - if not,it sets it and redirects to itself - if the session variable is set,some text is displayed. Find the code below [1]. I would expect thatthe text displayed after the redirect is indexed.
When Nutch crawls the page, I can see that the URL is parsed and has astatus of db_redir_temp (see output of stats [2]), but it is notindexed. I can also see in the segments that it fetches the originalpage, but does not index it [3].
Am I missing any configuration here or how can I get this working?
Any pointers really appreciated!
Thanks,
Elisabeth


[1] **** PHP code to replicate the issue:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
<html lang="en" dir="ltr" xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title>Original Page</title>
</head>
<body>
<?php
session_start();
define('SITE_ROOT', $_SERVER['HTTP_HOST']);
function userHasAccess(){
    $hasAccess = false;
    $sessionAccess = $_SESSION['hasAccess'];
    if(!empty($sessionAccess)){
        $hasAccess = true;
        session_destroy();
    }
    return $hasAccess;
}

if (userHasAccess()){
    echo "Long text to index......";
}
else {
    echo "No access - redirect happening here";
    $_SESSION['hasAccess'] = "true";
    header("Location: http://"; . SITE_ROOT . "/redirect-test/");
}
?>
</body>

[2]  **** output of "nutch readdb crawl/crawldb/ -stats"
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     1
retry 0:        1
min score:      1.0
avg score:      1.0
max score:      1.0
status 4 (db_redir_temp):       1
CrawlDb statistics: done

[3]  **** output of "nutch readseg ..."
Recno:: 0
URL:: http://10.64.58.83:8585/redirect-test/

CrawlDatum::
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Wed Sep 21 17:21:02 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 1 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1316618464674_pst_: temp_moved(13), lastModified=0:http://10.64.58.83:8585/redirect-test/
Content::
Version: -1
url: http://10.64.58.83:8585/redirect-test/
base: http://10.64.58.83:8585/redirect-test/
contentType: application/xhtml+xml
metadata: Content-Length=404 Expires=Thu, 19 Nov 1981 08:52:00 GMTLocation=http://10.64.58.83:8585/redirect-test/ _fst_=35Set-Cookie=PHPSESSID=08c7jg219fm9kve6d7sn0p1ku0; path=/nutch.segment.name=20110921172106 Connection=closeServer=Apache/2.2.19 (Win32) PHP/5.3.6 X-Powered-By=PHP/5.3.6Cache-Control=no-store, no-cache, must-revalidate, post-check=0,pre-check=0 Pragma=no-cache Date=Wed, 21 Sep 2011 15:21:04 GMTnutch.crawl.score=1.0 Content-Type=text/html
Content:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
<html lang="en" dir="ltr" xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title>Original Page</title>
</head>
<body>
Notice: Undefined index: hasAccess in C:\Program Files(x86)\Apache2.2\htdocs\redirect-test\index.php on line 13
No access - redirect happening here
</body>
CrawlDatum::
Version: 7
Status: 35 (fetch_redir_temp)
Fetch time: Wed Sep 21 17:21:08 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 1 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1316618464674_pst_: temp_moved(13), lastModified=0:http://10.64.58.83:8585/redirect-test/
On 21.09.2011 10:36, Elisabeth Adler wrote:
Hi,
Thanks for all the links, had a quick look through them already, but was
stuck with work, so I get around testing it this afternoon. I let you
know how it goes.
Best,
Elisabeth

On 21.09.2011 09:32, lewis john mcgibbney wrote:
Hi Elisabeth,

Did you sort your redirect problem?

On Sun, Sep 18, 2011 at 3:46 PM, Nutch User -
1<[email protected]>wrote:
On 15.09.2011 22:25, Elisabeth Adler wrote:
Hi,

I am having issues crawling an intranet site with an (imho) odd
redirect
mechanism. One part of the intranet website requires authentication
which
Nutch can bypass sending a special http.agent.name. This works fine.

The issue I am facing is that the server sends a redirect (302) after
successful authentication to the same URL. Nutch is not following the
redirect. My guess is that Nutch omits the site because it has been
visited
before...

Any pointers on how to overcome this and index the site after the
redirect
happened are very welcome. My configuration is below.
Thanks a lot,
Elisabeth


I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10
plugin.includes = protocol-http|urlfilter-regex|**
parse-html|index-(basic|**anchor)|query-(basic|site|url)**
|response-(json|xml)|summary-**basic|scoring-opic|**
urlnormalizer-(pass|regex|**basic)
These could give some explanation:

http://lucene.472066.n3.**nabble.com/URL-redirection-**
and-zero-scores-td3085311.html<http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html>
http://lucene.472066.n3.**nabble.com/A-possible-**solution-to-my-URL-**redirection-and-zero-scores-**problem-td3162164.html<http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html>
https://issues.apache.org/**jira/browse/NUTCH-1044<https://issues.apache.org/jira/browse/NUTCH-1044>

Re: Crawling and redirects to the same URL

Reply via email to