Koch Martina wrote:
Hi,

has anyone built a parsing plugin which decides on a per host basis how the 
content of the document should be parsed?

For example, if the title of a document is in the first <h1>-tag of a page for host1 
, but the title for a document of host2 is in the third <h2>-tag, the plugin would 
extract the title differently depending on the host.

In my opinion something like a dispatcher plugin would be needed:

-          Identify host of a document

-          Read and cache instructions on how to get the information for that 
host (database or config file)

-          Execute host-specific plugin

Do you have any suggestions on how to implement such a scenario efficiently? 
Has anyone implemented something similiar and can point out possible 
performance issues or other critical issues to be considered?

Yes, and yes. With the current plugin system you can create a new "dispatcher" plugin, and then add other necessary plugins as <import> elements. This way they will be accessible from the same classloader, so that you can instantiate them directly in your dispatcher plugin.

As for the lookup ... many solutions are possible. DB connections from map tasks may be problematic, both because of latency and the cost of setting up so many DB connections. OTOH, if you add local caching (using JCS or Ehcache) the hit/miss ratio should be decent enough. If the mapping of host names to plugins can be expressed by rules then maybe a simple rule set would be enough.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to