Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Michael Ji
hi, How is performance concern if the size of domain list reaches 10,000? Micheal Ji, --- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote: > [ > http://issues.apache.org/jira/browse/NUTCH-100?page=all > ] > > Gal Nitzan updated NUTCH-100: > - > >type

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan
Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread ogjunk-nutch
Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan <[EMAIL PROTECTED]> wrote: > Hi Michael, > > At the moment I have about 3000 domains in my db. I didn't time t

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Slightly off-topic, but I hope this is relevant to the original reason for creating this plugin... There is a B

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-11 Thread Gal Nitzan
Hi Otis, I have only a few thousands urls in my db at the moment. However, for a 100K it should be about 600-800KB. I do not cache the url itself, only a hash string. So the next time a url is searched in the cache if the hash exists than it is allowed. Regards, Gal [EMAIL PROTECTED] wrote

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-11 Thread Gal Nitzan
Hi, Well, the reason for this plugin is that i wish to crawl many sites but they all must be in my list. If it was implemented with regular expressions, the filter would still have to loop 100K expressions on each url for a match right? Regards, Gal Andrzej Bialecki wrote: [EMAIL PROTECTE

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi, Well, the reason for this plugin is that i wish to crawl many sites but they all must be in my list. If it was implemented with regular expressions, the filter would still have to loop 100K expressions on each url for a match right? No, that's the whole point - using t

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Gal Nitzan
Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). Regards, Gal Andrzej Bialecki wrote: Gal Nitzan wrote: Hi, Well, the reason for this plugin is that i wish to crawl many sites but they all

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). If you refer to the bricks automata library, it's BSD-licensed. I mentioned in one of the posts that the Innovation httpclient

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting
Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings & expressions) and it should be very fast to

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings & expressions) and it