Re: New Extension Points?

Marko Bauhardt Thu, 30 Jul 2009 00:35:55 -0700


On Jul 29, 2009, at 11:51 AM, Andrzej Bialecki wrote:

Marko Bauhardt wrote:
Hi,
i know you are working in the new "plugin system", osgi etc. but iwant to talk about new extension points.I think it would be helpfully if we have for example an extensionpoint IPreCrawl and IPostCrawl. This extension points can be use toimplement some helpfully jobs.For example before starting a new crawl one implementation ofIPreCrawl could be+ export urls from a "database" in a url file for inject this fileinto the crawldb
+ or create statistics.
If a crawl is finished one implementation of IPostCrawl could be
+ restart search servers
+ switch index
+ create statistics from this complete crawl
+ or sending email or whatever to an administrator...
This looks to me less like an extension point and more like anotification system, e.g. JMS-based. Currently the execution ofplugins in extension points is synchronous, i.e. the callingapplication will be blocked until the plugin completes itsexecution. Most likely you want an asynchronous execution here?


Hi Andrej,

no, i want a synchronous executions. my 'problem' is that i want toexecute some MapReduce Jobs between a shard is fetched or before acomplete crawl is started/ends. for that reason i have to executethese jobs synchronous. but currently all plugins are no MapReduceJobs. so i agree the creation of these ExtensionPoints are not thebest solution for this requirement.

Also i think statistics of a segment or the crawldb are veryimportant to get an overview about the url room. So maybe an otherextensionPoint (e.g. ISegmentStatistic) can be used to createstatistics for every segment after this segment is fetched.
I agree - segment parts are immutable, so once they are createdtheir statistics are also immutable. It would make even more senseto collect such stats on-the-fly as each part is being created, andthen write them out to a per-segment metadata file.

Ok. But how i can create a HostStatistic (how many urls are fetchedfrom every host) on the fly. currently i have a mapreduce job thatcreates the hoststatistic, but i can starts this job after the shardis fetched.

BTW: we really need to move away from using the name "segment",which is for many reasons confusing, and towards using the name"shard" which seems to be the commonly used name for this kind ofdata.


Ok.

btw:

we are working on the update of the administration ui. and we havesome plugins and mapred jobs that we execute. e.g. url export into aurl file for injecting, metadata indexing, black white filter andstatistics etc. so try to find a way to use all these stuff without toimplement a fixed Crawltool. i want to keep pluggable.

The question is how i can reach this to distribute an admin ui withdifferent functionality. For example one version with black whitefilter one without. Or maybe one version uses a hostStatistic anotherversion a other statistic. etc.




Marko




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com

Re: New Extension Points?

Reply via email to