On Jul 29, 2009, at 11:51 AM, Andrzej Bialecki wrote:

Marko Bauhardt wrote:
Hi,
i know you are working in the new "plugin system", osgi etc. but i want to talk about new extension points. I think it would be helpfully if we have for example an extension point IPreCrawl and IPostCrawl. This extension points can be use to implement some helpfully jobs. For example before starting a new crawl one implementation of IPreCrawl could be + export urls from a "database" in a url file for inject this file into the crawldb
+ or create statistics.
If a crawl is finished one implementation of IPostCrawl could be
+ restart search servers
+ switch index
+ create statistics from this complete crawl
+ or sending email or whatever to an administrator...

This looks to me less like an extension point and more like a notification system, e.g. JMS-based. Currently the execution of plugins in extension points is synchronous, i.e. the calling application will be blocked until the plugin completes its execution. Most likely you want an asynchronous execution here?

Hi Andrej,
no, i want a synchronous executions. my 'problem' is that i want to execute some MapReduce Jobs between a shard is fetched or before a complete crawl is started/ends. for that reason i have to execute these jobs synchronous. but currently all plugins are no MapReduce Jobs. so i agree the creation of these ExtensionPoints are not the best solution for this requirement.


Also i think statistics of a segment or the crawldb are very important to get an overview about the url room. So maybe an other extensionPoint (e.g. ISegmentStatistic) can be used to create statistics for every segment after this segment is fetched.

I agree - segment parts are immutable, so once they are created their statistics are also immutable. It would make even more sense to collect such stats on-the-fly as each part is being created, and then write them out to a per-segment metadata file.


Ok. But how i can create a HostStatistic (how many urls are fetched from every host) on the fly. currently i have a mapreduce job that creates the hoststatistic, but i can starts this job after the shard is fetched.


BTW: we really need to move away from using the name "segment", which is for many reasons confusing, and towards using the name "shard" which seems to be the commonly used name for this kind of data.

Ok.

btw:
we are working on the update of the administration ui. and we have some plugins and mapred jobs that we execute. e.g. url export into a url file for injecting, metadata indexing, black white filter and statistics etc. so try to find a way to use all these stuff without to implement a fixed Crawltool. i want to keep pluggable.

The question is how i can reach this to distribute an admin ui with different functionality. For example one version with black white filter one without. Or maybe one version uses a hostStatistic another version a other statistic. etc.



Marko




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com



Reply via email to