Re: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-07-24 Thread Andrzej Bialecki

Sami Siren (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12422929 ] 

Sami Siren commented on NUTCH-266:

--

I finally found the time to setup an environment with cygwin and try this out. I can confirm that the hadoop.jar version provided with nutch gives these errors. 


I then checked tested nutch with hadoop nightly jar and everything worked just 
fine.

Can someone try the hadoop nightly jar with nutch and see if it works for you. 
Nightly builds for hadoop are available from
http://people.apache.org/dist/lucene/hadoop/nightly/

  



Are you planning to update Hadoop to trunk/ ? I'd rather be careful with 
that - I'm not sure if it's still compatible with Java 1.4, besides 
being unreleased/unstable ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-07-24 Thread Sami Siren




Are you planning to update Hadoop to trunk/ ? I'd rather be careful 
with that - I'm not sure if it's still compatible with Java 1.4, 
besides being unreleased/unstable ...


Not planning an upgrade, just wan't to know if it resolves the issues. 
We can then decide what's the best thing to do.


--
Sami Siren




[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-07-24 Thread Enrico Triolo (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422996 ] 

Enrico Triolo commented on NUTCH-322:
-

Ok, I can see your point, nevertheless I think we should consider some 
potential problems that could arise from such modifications:

  1. When a redirect occours, both the redirecting and the redirected pages 
should be indexed, independently of crawling depth, but I think this is what 
you meant from the beginning...
  2. How should linkdb updated? Or better, should linkdb be updated somehow? I 
mean, if page A has a link to page B, and page B redirects to C, should we set 
an incoming link to C from A?

 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8-dev
Reporter: Andrzej Bialecki 
 Fix For: 0.8-dev


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-167) Observation of META NAME=ROBOTS CONTENT=NOARCHIVE directive

2006-07-24 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-167?page=all ]

Andrzej Bialecki  updated NUTCH-167:


Attachment: patch.txt

This patch implements support for Pragma: no-cache and Robots: noarchive.

Three cache policies are supported in this patch:

* CACHE_FORBIDDEN_CONTENT: for pages that specify noarchive, only summaries 
will be shown, but a cached content won't be displayed.

* CACHE_FORBIDDEN_ALL: for pages that specify noarchive, neither summaries 
nor cached content will be shown - although they will appear in the list of 
matching results.

* CACHE_FORBIDDEN_NONE: even for sites that specify noarchive Nutch will 
still disobey it, and show both summaries and cached content. This is the 
current (broken?) behavior.

Since this patch is important for legal reasons, I'd like to commit it soon, 
before 0.8 release.

 Observation of META NAME=ROBOTS CONTENT=NOARCHIVE directive
 -

 Key: NUTCH-167
 URL: http://issues.apache.org/jira/browse/NUTCH-167
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, web gui
Affects Versions: 0.7.1
Reporter: Ed Whittaker
Priority: Critical
 Attachments: patch.txt


 Though not strictly a bug, this issue is potentially serious for users of 
 Nutch who deploy live systems who might be threatened with legal action for 
 caching copies of copyrighted material. The major search engines all observe 
 this directive (even though apparently it's not stanard) so there's every 
 reason why Nutch should too.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Closed: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-24 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]

Andrzej Bialecki  closed NUTCH-324.
---

Fix Version/s: 0.8-dev
   Resolution: Fixed

Patch applied, with minor whitespace diffs and doc. clarifications. Thank you!

 db.score.link.internal and db.score.link.external are ignored
 -

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev

 Attachments: InternalAndExternalLinkScoreFactor.patch


 Configuration properties db.score.link.external and db.score.link.internal  
 are ignored.
 In case of e.g. message board webpages or pages that have larger navigation 
 menus on each page having a lower impact of internal links makes a lot of 
 sense for scoring.
 Also for web spam this is a serious problem, since now spammers can setup 
 just one domain with dynamically generated pages and this highly manipulate 
 the nutch scores. 
 So I also suggest that we give db.score.link.internal by default a value of 
 something like 0.25. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-07-24 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12423187 ] 

Andrzej Bialecki  commented on NUTCH-322:
-

Good questions ... ;)

ad 1: Google shows only the final page, and you can access it through both the 
original (starting) url and the final redirected url. You can't view the 
intermediate pages.

To be Google-compatible we should index only the final page, but put it under 
both URLs. This is relatively easy to implement in Fetcher and index-basic, by 
appropriately marking the starting and intermediate pages, skipping any 
non-final pages during indexing, and then adding the original url to the final 
url when indexing the final page.

Also, I think that if redirect refresh time is large (e.g. larger than 20 
seconds) we should consider the pages to be separate, and treat them separately.

ad 2: Google shows only inlinks going to the final url. However, the same 
inlinks can be obtained by using either the starting or the final url. OTOH MSN 
has separate inlinks in each case. I'm not sure yet how we should implement 
this...

 Fetcher discards ProtocolStatus, doesn't store redirected pages
 ---

 Key: NUTCH-322
 URL: http://issues.apache.org/jira/browse/NUTCH-322
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8-dev
Reporter: Andrzej Bialecki 
 Fix For: 0.8-dev


 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
 contains important information, such as protocol-level response code, 
 lastModified time, and possibly other messages.
 I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
 which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
 addition, if ProtocolStatus contains a valid lastModified time, that 
 CrawlDatum's modified time should also be set to this value.
 Additionally, Fetcher doesn't store redirected pages. Content of such pages 
 is silently discarded. When Fetcher translates from protocol-level status to 
 crawldb-level status it should probably store such pages with the following 
 translation of status codes:
 * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code 
 indicates a transient change, so we probably shouldn't mark the initial URL 
 as bad.
 * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a 
 permanent change, so the initial URL is no longer valid, i.e. it will always 
 result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: segread vs. readseg

2006-07-24 Thread Stefan Groschupf

I like it!

Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki:


Stefan Neufeind wrote:

Andrzej Bialecki wrote:

Stefan Groschupf wrote:

Hi developers,

we have command like readdb and readlinkdb but segread. Wouldn't  
be more consistent to name the command readseg instead segread?

... just a thought.


Yes, it seems more consistent. However, if we change it then  
scripts people wrote would break. We could support both aliases  
in 0.8, and give a deprecation message.


What do others think?


Same feeling here. Agreed.


What about the following?

Index: bin/nutch
===
--- bin/nutch(revision 424960)
+++ bin/nutch(working copy)
@@ -40,7 +40,7 @@
  echo   generate  generate new segments to fetch
  echo   fetch fetch a segment's pages
  echo   parse parse a segment's pages
-  echo   segread   read / dump segment data
+  echo   readseg   read / dump segment data
  echo   mergesegs merge several segments, with optional  
filtering and slicing
  echo   updatedb  update crawl db from segments after  
fetching

  echo   invertlinks   create a linkdb from parsed segments
@@ -158,7 +158,10 @@
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ $COMMAND = readlinkdb ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
+elif [ $COMMAND = readseg ] ; then
+  CLASS=org.apache.nutch.segment.SegmentReader
elif [ $COMMAND = segread ] ; then
+  echo [DEPRECATED] Command 'segread' is deprecated, use  
'readseg' instead.

  CLASS=org.apache.nutch.segment.SegmentReader
elif [ $COMMAND = mergesegs ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com







Why was prune removed in 0.8?

2006-07-24 Thread Stefan Neufeind

Hi,

I might be bringing up old discussions (sorry if so) - but discussing 
about segread/readseg I wondered why prune is missing in bin/nutch. 
It's still working when you give the full classname by hand. But could 
it be (re)added to bin/nutch again as well?



Regards,
 Stefan


Re: Why was prune removed in 0.8?

2006-07-24 Thread Andrzej Bialecki

Stefan Neufeind wrote:

Hi,

I might be bringing up old discussions (sorry if so) - but discussing 
about segread/readseg I wondered why prune is missing in bin/nutch. 
It's still working when you give the full classname by hand. But could 
it be (re)added to bin/nutch again as well?


I think PruneIndexTool is not fully compatible in the command-line usage 
with the current layout of indexes. I mean that in 0.8 indexes are not 
created inside each segment directory, and also a single output index 
consists of as many parts as there were reduce tasks ... so, some 
fiddling around with paths and arguments will be necessary to fix it.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com