Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Gal Nitzan Sun, 09 Oct 2005 11:26:31 -0700

Hi Michael,

At the moment I have about 3000 domains in my db. I didn't time theperformance however having even 100k domains shouldn't have an impactsince it is fetched only once from the database to the cache. A littleperformance hit should be over 100k (depends on number elements definedin xml file).

After a few birth problems, the plugin works nicely and I do not feelany impact.


Regards,

Gal


Michael Ji wrote:

hi,

How is performance concern if the size of domain list
reaches 10,000?

Micheal Ji,

--- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote:

http://issues.apache.org/jira/browse/NUTCH-100?page=all

]

Gal Nitzan updated NUTCH-100:
-----------------------------

           type: Improvement  (was: New Feature)

Description:Hi,


I have written a new plugin, based on the URLFilter
interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the
URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


    Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion

New plugin urlfilter-db
-----------------------

         Key: NUTCH-100
         URL:

http://issues.apache.org/jira/browse/NUTCH-100

     Project: Nutch
        Type: Improvement
  Components: fetcher
    Versions: 0.8-dev
 Environment: All Nutch versions
    Reporter: Gal Nitzan
    Priority: Trivial
 Attachments: AddedDbURLFilter.patch,

urlfilter-db.tar.gz, urlfilter-db.tar.gz

Hi,
I have written a new plugin, based on the

URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains,

i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache,

easier to deploy than JCS) and on the back-end a
database.

For each url
   filter is called
end for
filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter
The plugin reads the cache size, jdbc driver,

connection string, table to use and domain field
from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of
the administrators:

http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

__________________________________Yahoo! Music UnlimitedAccess over 1 million songs. Try it free.

http://music.yahoo.com/unlimited/

.

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Reply via email to