On Mon, 17 Jan 2005 16:17:46 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Or we could provide a separate hook to call some other type of filter,
> let's say ExtendedContentFilter, after the Content has been parsed:
>
> Content filter(Content content, Parse parse);
>
> This ap
Hi,
After analyzing some of the search results from my ~10mln pages index, I
noticed a few strange results. It seems to me that:
* the DefaultSimilarity seems to excessively favor small lengths of
"content" (high tf) and anchor texts (too high boost value?).
* title is not indexed nor tokenized
Doug, I knew there had to be a bug on my end. ;-) Your suggestion was
exactly right. So after that fix and slight fiddling with the sample
htdocs files, I see the following (via "readdb -dumplinks"):
index.html: 4 inlinks
eggs1.html: 3 inlinks
eggs(2-4).html: 1 inlink
This results in the followi
Doug,
the crawler was just an example, however you are right and I agree with
the KISS development concept.
I don't think it make much sense to have a 'grid' with 100 boxes and
all boxes crawl or all boxes have one segment.
May the idea in you blog with 'Dynamization and Lucene' can play a
inte
Andrej, comments are inline...
On Mon, 17 Jan 2005 13:33:37 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> While the idea of ContentFilter is very useful, I have some doubts
> regarding the use of URLFilter during fetching. If you don't want to
> fetch some urls, then you should not put them
Matt Kangas wrote:
Here is the output from "nutch readdb -dumplinks". This is a clearly a
truncated link topology for these [ages. Is this the result of a bug
in my script? Or is this something the tool should clean up?
It looks like db.ignore.internal.links is true, so that all but the
first inte
Stefan Groschupf wrote:
the google file system support multiple clients writing to one file ( or
may chunk).
In case we porting nutch functionality to map and reduce this would be
very useful as well.
For example a set of crawlers writing to one 'segment file'.
Does the actually implementation o
Bugs item #1104040, was opened at 2005-01-17 17:31
Message generated for change (Comment added) made by msashnikov
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1104040&group_id=59548
Category: searcher
Group: None
Status: Open
Resolution: None
Priority
Bugs item #1104040, was opened at 2005-01-17 17:31
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1104040&group_id=59548
Category: searcher
Group: None
Status: Open
Resolution:
Christophe:
I think the issue here is as follow:
The score of these 2 pages is higher. Lets look at it more
closely
- www.cetic.com -- I bet
does not have a lot of inlinks to it and definitely not as many outlinks as the
forum pages. Also if the work cetic appears more on the forum page
Christophe Noel wrote:
I would just like to show you a little problem with Nutch and get your
comments about it :
After crawling a set of domains (cetic.be , ...)
I submit :
Search : cetic
Nutch give back some results :
Could you please show us what you get when you view the "explain" page
of th
what kind of hardware do you have? how much disk
space does your webdb take up?
the webdb is reconstructed every time you make updates
to it (usually during fetchlist generation and
updatedb), so it's not unusual for it to take a while.
it's more i/o intensive than memory or cpu.
--- Andre
I would just like to show you a little problem with Nutch and get your
comments about it :
After crawling a set of domains (cetic.be , ...)
I submit :
Search : cetic
Nutch give back some results :
===
Hits 1-3 (out of about 1,342 total matching pages):
WWW.CETIC.BE :: Index
..
Chirag Chaman wrote:
Andrzej:
On the same note, let me list examples of certain analysis that should be
helpful and I'd appreciate it if you can point where is an appropriate place
to add the code. Right now these sit external for us, but it would be nice
to integrate them to Nutch.
A general note:
$B#2#0#0#5G/5.J}MM$K$H$C$F$h$jNI$$G/$K$J$j$^$9MM$K!#!!(B
(B $BL$>5Bz9-9p"((B $B$4LBOG$JJ}$O:o=|$7$F$/[EMAIL PROTECTED](B
$BEv9-9p$re5~6h8fA0DL:[EMAIL PROTECTED](B
$BBeI=pJs$O%M%C%H%S%8%M%9$N2V7A$G$9!#(B
$B;qK\6b$O$?$C$?$N(B5000$B1_!#Aa$$$b$N>!$A$G$9!*!!(B
(B[EMAIL PROTECTED];[EMAIL P
Andrzej:
On the same note, let me list examples of certain analysis that should be
helpful and I'd appreciate it if you can point where is an appropriate place
to add the code. Right now these sit external for us, but it would be nice
to integrate them to Nutch.
1. Content - total size < X bytes
Matt Kangas wrote:
Stefan and/or Doug,
Here's a followup to my Jan 3 diff. This time I added two hooks to the
Fetcher, for URLFilter and also for a new interface, ContentFilter.
These allow one to:
- filter out URLs prior to fetching, and
- filter out fetched content prior to writing to a segment
W
I believe the stock Nutch does not employ the stemming or the
prefixquery from Lucene. Is this because such queries are too expensive?
Or is it that they are just not useful, that 99% of Nutch users just
don't need them?
Lucene has some stemming modules for English and German I see. Does
English
After do some more research and I found Nutch use it's own
Analyzer--NutchDocumentAnalyzer.Can I change it's token behaiver by
develop a
plugin?
Ansi
On Fri, 14 Jan 2005 16:35:04 +0800, ansi <[EMAIL PROTECTED]> wrote:
> hi,all
>
> I found Nutch use StandardAnalyzer to index Chinese.
> I'd like t
Hi,
I worked on Werner Ramekers Excel plugin, and Stephan Strittmatter made
some work also but I'm not sure whether it was on Excel or Powerpoint.
I've corrected Werner's code but also adapted to my own framework for the
project I'm currently working on.
There might be a few changes to do before in
20 matches
Mail list logo