subject:"\[jira\] Commented\: \(NUTCH\-353\) pages that serverside forwards will be refetched every time"

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2009-02-03 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669947#action_12669947
 ] 

Andrzej Bialecki  commented on NUTCH-353:
-

Actually, the problem in the issue description is solved now. I'm closing this 
one, and the remaining functionality should be tracked as an enhancement in a 
separate issue.

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: https://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260
]

Ken Krugler commented on NUTCH-353:
---

Another small note about this (see NUTCH-411 for a related but different
problem) ...

If a page (e.g. http://boutell.com) returns a meta refresh header (e.g. meta
http-equiv=refresh content=0;url=http://www.boutell.com/;), and you also
wind up fetching the target page independently, then it looks like you can wind
up with both pages in the crawl results. One entry has a title like
boutell.com, while the other has the real page title. Or at least I've seen
this a few times in our crawl results.

pages that serverside forwards will be refetched every time
---

Key: NUTCH-353
URL: https://issues.apache.org/jira/browse/NUTCH-353
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki
Priority: Blocker
Fix For: 0.9.0

Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back
into the crawlDb. Also the nextFetchTime is not changed.
This causes a refetch of the same page again and again. The result is nutch
is not polite and refetching the forwarding and target page in each segment
iteration. Also it effects the scoring since the forward page contribute it's
score to all outlinks.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Ken Krugler (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261
 ] 

Ken Krugler commented on NUTCH-353:
---

Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I 
just described in my previous comment. I don't think our latest public crawl 
was done with this patch.

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: https://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466280
]

Andrzej Bialecki commented on NUTCH-353:
-

I believe the patch in NUTCH-273 fixes a large part of the problem, that you
describe - we record the fact that there was a redirect, and Indexer indexes
only the final page.

The other parts though (correct treatment of inlink information, and selection
of representative pages for chains of redirects) is not addressed yet.

pages that serverside forwards will be refetched every time
---

Attachments: doNotRefecthForwarderPagesV1.patch

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Doug Cook (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284
]

Doug Cook commented on NUTCH-353:
-

I have a local fix for this problem (partly Paul Gauthier's work, partly mine)
that I have been testing for some time. It's a little bit of a hack, but it's
much better than just indexing the redirect target (which is the wrong behavior
in many instances; see comments earlier).

The fix is to index both instances of the page, both the source and the target,
making sure that the outlinks from the target page are only assigned to the
target page. This way, in the (frequent) case that the redirect *source* is the
canonical version of the page, with more anchor text, it will show up for
searches. The fix seems to work pretty well, and solves a significant
percentage of Nutch's missing home pages problem without using much extra
space in the index. If it sounds useful to anyone, I'm happy to contribute it
back.

Doug

pages that serverside forwards will be refetched every time
---

Attachments: doNotRefecthForwarderPagesV1.patch

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466285
 ] 

Chris A. Mattmann commented on NUTCH-353:
-

Doug,

  Let's see what you got. I'd be happy to take a look at it. 

Cheers,
  Chris


 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: https://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-05 Thread Uros Gruber (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12440221 ] 

Uros Gruber commented on NUTCH-353:
---

I don't think there is 100% solution. Mostly because not all respect standards. 
For example www.imb.com uses 302 status code which by RFC definition - (The 
requested resource resides temporarily under a different URI. Since the 
redirection might be altered on occasion, the client SHOULD continue to use the 
Request-URI for future requests. This response is only cacheable if indicated 
by a Cache-Control or Expires header field. ). This case is clear. We should 
use original URL.

But then there is also permanent redirect which SHOULD replace old URL and also 
update all links pointing to old URL with new one.

I also saw some examples of wrong redirections. One of them was my fault to. I 
use Alias definition with apache server for accepting connections without www 
subdomain. And then with the page I left link to main page pointing to 
index.php instead of just /. After a while my domain.si/index.php became  more 
important than www.domain.si (bot points to the same site)

So as I see this job is not simple at all. Maybe we need a schema or some sort 
of flow diagram to indicate what to do in determinant situation.

I hope my notes helps a bit because at the moment we really have a lot of 
unwanted urls in our index.


 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Uroš Gruber

Ken Krugler (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ]

Ken Krugler commented on NUTCH-353:

---

+1 that the redirect target is not always the real URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds true for most
(all?) developerWorks pages; they redirect to www-128.ibm.com/whatever, but IBM would
love for the URL everybody sees to still be www.ibm.com/whatever.

If you check status code of the original URL you get 302 Found. By
definition

10.3.3 302 Found

The requested resource resides temporarily under a different URI. Since
the redirection might be altered on occasion, the client SHOULD continue
to use the Request-URI for future requests. This response is only
cacheable if indicated by a Cache-Control or Expires header field.

In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I
don't se any proper solution for both.

regards

Uros

pages that serverside forwards will be refetched every time
---

Key: NUTCH-353
URL: http://issues.apache.org/jira/browse/NUTCH-353
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki
Priority: Blocker

Fix For: 0.9.0

Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed.
This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cook

In this case, the site uses the right kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of the
page). And then there's the problem of what to do with meta refresh tags,
which don't have a permanent vs. temporary indication.

An alternative is to use the link structure - the page with the most
external links is likely the canonical version of the page. (Although with
permanent redirects, there is a time lag as sites linking to the page stop
using the old name and start using the new name). This won't work well in
small crawls, though, given the relative paucity of links.

In any case, if we have an inexpensive way of aliasing the two to be the
same, we won't lose any anchor text, and we're effectively not throwing
out either URL, so it matters less which one we choose.

-Doug

Uro? Gruber-2 wrote:

Ken Krugler (JIRA) wrote:
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
]

Ken Krugler commented on NUTCH-353:
---

+1 that the redirect target is not always the real URL that we want to
keep.

For example,
http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
holds true for most (all?) developerWorks pages; they redirect to
www-128.ibm.com/whatever, but IBM would love for the URL everybody sees
to still be www.ibm.com/whatever.

If you check status code of the original URL you get 302 Found. By
definition

10.3.3 302 Found

In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I
don't se any proper solution for both.

regards

Uros
pages that serverside forwards will be refetched every time
---

Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change
back into the crawlDb. Also the nextFetchTime is not changed.
This causes a refetch of the same page again and again. The result is
nutch is not polite and refetching the forwarding and target page in
each segment iteration. Also it effects the scoring since the forward
page contribute it's score to all outlinks.

--
View this message in context:
http://www.nabble.com/-jira--Created%3A-%28NUTCH-353%29-pages-that-serverside-forwards-will-be-refetched-every-time-tf2125422.html#a6622168
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ] 

Doug Cutting commented on NUTCH-353:


It's worth noting that Google, Yahoo! and Microsoft's searches all return lots 
of links to www-XXX.ibm.com.  Just some evidence that this may not be an easy 
problem to solve.

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Doug Cook (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] 

Doug Cook commented on NUTCH-353:
-

This is definitely a complex issue. It is also high priority -- issues with 
redirects and duplicates, which URL is chosen, and what happens to the anchor 
text for the pages involved are causing significant relevance issues.
A few observations:

(1) A redirect *target* is not always the canonical version of a URL. For 
example, is very common for root-level pages to redirect to an internal home 
page (some 30% of the root pages in my index do so). However, the root pages 
have all the anchor text and are truly the canonical, permanent version of the 
page; the internal redirect target is just the temporary homepage, and could 
change at any time depending on the site implementation. Here are some examples:
http://www.landwirtschaft-bw.info/
http://www.dlr-rnh.rlp.de/
http://www.niederoesterreich.at/
Because of the current policy of discarding the redirect source, I lose 30% 
of the home pages in my index, which makes my relevance very poor for 
navigational queries.

In this case, we would likely want to mark the internal redirect target as an 
alias as Andrzej suggests, and automatically transfer any link information to 
the root page.

(2) There may be other cases where we want to alias two pages, either to avoid 
recrawling them, or to merge anchor text. Suppose we crawl both 
 http://www.x.com/
and
 http://www.x.com/index.html
and these are the same document.

Right now we will always crawl both of these, and the dedup algorithm will pick 
one (sadly often the /index.html version due to strange score anomalies), and 
throw out the anchor text for the other. While we can't safely normalize these 
two URLs to be the same in advance of seeing the content, once we see that the 
signatures are the same, we can, and should, merge them so that the index.html 
version is marked as an alias of the / version, and future crawls simply skip 
crawling the /index.html version and transfer its link information to the / 
page.

This problem, like the first one, is causing me to lose root-level URLs along 
with their anchor text, further affecting relevance for navigational queries.

In short, I agree with Andrzej that we need a way to mark a URL as an alias of 
another, to avoid recrawl, and to merge link information. We need to be 
careful, however, of *which* URL we pick. It is not always the redirect target 
that should win. And some of our current concept of duplicates should also be 
subsumed under the new notion of alias.

I'm happy to help out in any way with a fix. I'm just looking at hacking 
together something in my own environment because the problems are affecting me 
so severely, but as I'm new-ish to Nutch, what I come up with might not be as 
elegant or flexible as what others might envision...

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Ken Krugler (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] 

Ken Krugler commented on NUTCH-353:
---

+1 that the redirect target is not always the real URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html = 
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds 
true for most  (all?) developerWorks pages; they redirect to 
www-128.ibm.com/whatever, but IBM would love for the URL everybody sees to 
still be www.ibm.com/whatever.

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-09-23 Thread Andrzej Bialecki (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12437131 ] 

Andrzej Bialecki  commented on NUTCH-353:
-

I think this issue requires more discussion, especially how it affects the 
linkdb.

Let's say that page A links to B, but B redirects to C. Issues to discuss:

* should we mark B as gone? we could do so, to prevent refetching. We should 
also store the redirect url in CrawlDatum.metaData. This redirect url may 
change in the future to some other value, but since no page is ever truly gone 
(we should retry it at some point in the future) we should be able to adjust 
the redirect info.

* for all practical purposes, C now becomes a replacement for B. Should we 
transfer all inlink information (anchor text, incoming urls, and score 
contributions) to C? From the implementation point of view this would require 
changes to linkdb format, to be able to create aliases that automatically 
transfer all inlink information to C even though it's inserted under B ..

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0, 0.8.1
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

13 matches

Site Navigation

Mail list logo

Footer information