Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). If you refer to the bricks automata library, it's BSD-licensed. I mentioned in one of the posts that the Innovation

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting
Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread ogjunk-nutch
Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan [EMAIL PROTECTED] wrote: Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Slightly off-topic, but I hope this is relevant to the original reason for creating this plugin... There is a

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan
Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined