Re: [Nutch-dev] Implementing geography-by-IP filtering?

2005-01-17 Thread Matt Kangas
On Mon, 17 Jan 2005 16:17:46 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Or we could provide a separate hook to call some other type of filter, > let's say ExtendedContentFilter, after the Content has been parsed: > > Content filter(Content content, Parse parse); > > This ap

[Nutch-dev] Adding title and site to scoring

2005-01-17 Thread Andrzej Bialecki
Hi, After analyzing some of the search results from my ~10mln pages index, I noticed a few strange results. It seems to me that: * the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?). * title is not indexed nor tokenized

Re: [Nutch-dev] refetching all pages to update anchor text?

2005-01-17 Thread Matt Kangas
Doug, I knew there had to be a bug on my end. ;-) Your suggestion was exactly right. So after that fix and slight fiddling with the sample htdocs files, I see the following (via "readdb -dumplinks"): index.html: 4 inlinks eggs1.html: 3 inlinks eggs(2-4).html: 1 inlink This results in the followi

Re: [Nutch-dev] ndfs multiple clients writing

2005-01-17 Thread Stefan Groschupf
Doug, the crawler was just an example, however you are right and I agree with the KISS development concept. I don't think it make much sense to have a 'grid' with 100 boxes and all boxes crawl or all boxes have one segment. May the idea in you blog with 'Dynamization and Lucene' can play a inte

Re: [Nutch-dev] Implementing geography-by-IP filtering?

2005-01-17 Thread Matt Kangas
Andrej, comments are inline... On Mon, 17 Jan 2005 13:33:37 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > While the idea of ContentFilter is very useful, I have some doubts > regarding the use of URLFilter during fetching. If you don't want to > fetch some urls, then you should not put them

Re: [Nutch-dev] refetching all pages to update anchor text?

2005-01-17 Thread Doug Cutting
Matt Kangas wrote: Here is the output from "nutch readdb -dumplinks". This is a clearly a truncated link topology for these [ages. Is this the result of a bug in my script? Or is this something the tool should clean up? It looks like db.ignore.internal.links is true, so that all but the first inte

Re: [Nutch-dev] ndfs multiple clients writing

2005-01-17 Thread Doug Cutting
Stefan Groschupf wrote: the google file system support multiple clients writing to one file ( or may chunk). In case we porting nutch functionality to map and reduce this would be very useful as well. For example a set of crawlers writing to one 'segment file'. Does the actually implementation o

[Nutch-dev] [ nutch-Bugs-1104040 ] Missing spaces in summary (easy to fix)

2005-01-17 Thread SourceForge.net
Bugs item #1104040, was opened at 2005-01-17 17:31 Message generated for change (Comment added) made by msashnikov You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1104040&group_id=59548 Category: searcher Group: None Status: Open Resolution: None Priority

[Nutch-dev] [ nutch-Bugs-1104040 ] Missing spaces in summary (easy to fix)

2005-01-17 Thread SourceForge.net
Bugs item #1104040, was opened at 2005-01-17 17:31 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1104040&group_id=59548 Category: searcher Group: None Status: Open Resolution:

RE: [Nutch-dev] searching problem illustrated

2005-01-17 Thread Chirag Chaman
Christophe:   I think the issue here is as follow:   The score of these 2 pages is higher. Lets look at it more closely   - www.cetic.com -- I bet does not have a lot of inlinks to it and definitely not as many outlinks as the forum pages. Also if the work cetic appears more on the forum page

Re: [Nutch-dev] searching problem illustrated

2005-01-17 Thread Andrzej Bialecki
Christophe Noel wrote: I would just like to show you a little problem with Nutch and get your comments about it : After crawling a set of domains (cetic.be , ...) I submit : Search : cetic Nutch give back some results : Could you please show us what you get when you view the "explain" page of th

Re: [Nutch-dev] Fetchlist generation taking hours...

2005-01-17 Thread Xin-Yi Liu
what kind of hardware do you have? how much disk space does your webdb take up? the webdb is reconstructed every time you make updates to it (usually during fetchlist generation and updatedb), so it's not unusual for it to take a while. it's more i/o intensive than memory or cpu. --- Andre

[Nutch-dev] searching problem illustrated

2005-01-17 Thread Christophe Noel
I would just like to show you a little problem with Nutch and get your comments about it : After crawling a set of domains (cetic.be , ...) I submit : Search : cetic Nutch give back some results : === Hits 1-3 (out of about 1,342 total matching pages): WWW.CETIC.BE :: Index ..

Re: [Nutch-dev] Implementing geography-by-IP filtering?

2005-01-17 Thread Andrzej Bialecki
Chirag Chaman wrote: Andrzej: On the same note, let me list examples of certain analysis that should be helpful and I'd appreciate it if you can point where is an appropriate place to add the code. Right now these sit external for us, but it would be nice to integrate them to Nutch. A general note:

[Nutch-dev] 未承諾公告※5000円で開業しませんか!

2005-01-17 Thread スペースドア
$B#2#0#0#5G/5.J}MM$K$H$C$F$h$jNI$$G/$K$J$j$^$9MM$K!#!!(B (B $BL$>5Bz9-9p"((B $B$4LBOG$JJ}$O:o=|$7$F$/[EMAIL PROTECTED](B $BEv9-9p$re5~6h8fA0DL:[EMAIL PROTECTED](B $BBeI=pJs$O%M%C%H%S%8%M%9$N2V7A$G$9!#(B $B;qK\6b$O$?$C$?$N(B5000$B1_!#Aa$$$b$N>!$A$G$9!*!!(B (B[EMAIL PROTECTED];[EMAIL P

RE: [Nutch-dev] Implementing geography-by-IP filtering?

2005-01-17 Thread Chirag Chaman
Andrzej: On the same note, let me list examples of certain analysis that should be helpful and I'd appreciate it if you can point where is an appropriate place to add the code. Right now these sit external for us, but it would be nice to integrate them to Nutch. 1. Content - total size < X bytes

Re: [Nutch-dev] Implementing geography-by-IP filtering?

2005-01-17 Thread Andrzej Bialecki
Matt Kangas wrote: Stefan and/or Doug, Here's a followup to my Jan 3 diff. This time I added two hooks to the Fetcher, for URLFilter and also for a new interface, ContentFilter. These allow one to: - filter out URLs prior to fetching, and - filter out fetched content prior to writing to a segment W

[Nutch-dev] Stemming or PrefixQuery

2005-01-17 Thread Steve Follmer
I believe the stock Nutch does not employ the stemming or the prefixquery from Lucene. Is this because such queries are too expensive? Or is it that they are just not useful, that 99% of Nutch users just don't need them? Lucene has some stemming modules for English and German I see. Does English

[Nutch-dev] Re: How to change the default Tokenizer?

2005-01-17 Thread ansi
After do some more research and I found Nutch use it's own Analyzer--NutchDocumentAnalyzer.Can I change it's token behaiver by develop a plugin? Ansi On Fri, 14 Jan 2005 16:35:04 +0800, ansi <[EMAIL PROTECTED]> wrote: > hi,all > > I found Nutch use StandardAnalyzer to index Chinese. > I'd like t

Re: [Nutch-dev] Excel file plugin

2005-01-17 Thread Stephan Lagraulet
Hi, I worked on Werner Ramekers Excel plugin, and Stephan Strittmatter made some work also but I'm not sure whether it was on Excel or Powerpoint. I've corrected Werner's code but also adapted to my own framework for the project I'm currently working on. There might be a few changes to do before in