Re: Static initializers
Hi, This is what i did to make NutchConf behave not so static, without patching any of those 195 places Stefan mentioned. NutchConf.get() yields the current config. OpenConf sets a new current config. finally CloseConf closes this config. But be warned about issues with the plugin cache mentioned earlier. Greeting Marcel Schnippe. public class NutchConf { //... public static final NutchConf DEFAULT = new NutchConf(); private static ThreadLocal threadNutchConf = new ThreadLocal() { protected synchronized Object initialValue() { Stack confs = new Stack(); confs.add(NutchConf.DEFAULT); return confs; } }; /** Return the current default configuration. (see [EMAIL PROTECTED] #OpenConf}) */ public static NutchConf get() { return (NutchConf) (((Stack) (threadNutchConf.get())).lastElement());}; /** Open new thread specific configuration, which will be returned by * calls to [EMAIL PROTECTED] #get} until finally closed by [EMAIL PROTECTED] #CloseConf}. * @param conf a NutchConf generated with new NutchConf and [EMAIL PROTECTED] #addConfResource}. */ public static void OpenConf (NutchConf conf) { Stack confs = (Stack) (threadNutchConf.get()); confs.add(conf); }; /** Close configuration opend by [EMAIL PROTECTED] #OpenConf}, return to previous or default+site configuration */ public static void CloseConf() { ((Stack) (threadNutchConf.get())).pop(); }; //... };
Re: Static initializers
Andrzej, well I'm not ready with digging into the problem but want to ask some more questions. BTW I counted 195 places that use NutchConf.get(), so this will be a bigger patch. :) As I mentioned I would love to go the inversion of control way, so not using nutchConf in the constructor but make classes implementing the Configurable interface. This for example would be sensefully for all classed realizing a extension. But there are also classes where this makes no sense. For example I would suggest to change the PluginRegestry from a singleton to a 'normal' object, in this case I guess it make sense to use the nutchConf in the constructor, since the configuration here only need to know the include and exclude regex for the plugins. So: Extension.getExtensionInstance() - getExtensionInstance(NutchConf) This makes sense, here we can check if the class implements the configurable interface and if so instantiate the object and set the configuration. ExtensionPoint.getExtensions() - getExtensions(NutchConf) We don't need NutchConf here since if I understand it correct this is only needed to identify the activated plugins and this is done until regestry instantiation that in this case take a NutchConf as parameter. PluginRepository.getExtensionPoint(String) - getExtensionPoint (String, NutchConf) We don't need it here as well, since we use NutchConf until regestry instantiation. The other case would be that we have to build up the plugin dependency graph for each method call. Would you agree to have a several plugin regestries with may be different NutchConf's but instantiate extensions with nutchConf but not query ExtentsionPoints etc? etc, etc... The way this would work would be similar to the mechanism described above: if plugin instances are not created yet, they would be created once (based on the current NutchConf argument), and then cached in this NutchConf instance. I guess this is difficult. First we have the plugin class instances, most or may all plugins I know do not have a plugin class implementation, second we have the extensions classes that at least do not need to implement a specific interface from the plugin regestry point of view (only such things like index filter interface etc.) Caching plugin class instances makes sense since actually there is only one plugin class instance per plugin in the jvm. However there will be many instances for each extension class, since e.g. the parser or protocoll runs multithreaded. And also the plugin implementations would have to extend NutchConfigured, taking NutchConf as the argument to their constructors - because now the Extension.getExtensionInstance would pass the current NutchConf instance to their contructors. In general my point of view is that: In case we touch this issue anyway I would love to do a radical solution, since i have a other understanding of handling parameters than collect them in a kind of map and make the map general accessible. Instead of giving any object access to the configuration object and handle properties like a bazar I would prefer handle configuration only in the first object in the stack, that would be in our case for example the indexing tool. Than the indexing tool instantiate the plugin registry only with the required properties that would be part of the constructor, e.g. pluginFolders, include, exlude reg ex and autoactivation flag. Later the extension instances can be also get some more values injected, but has in general no access to the configuration object. This would first of all make things better testable but also allows much much more flexibility to run several different fetchers in one jvm. Anyway this would be may a imporvement suggestion from me for nutch 2.0 or 3.0 for now we would be some steps forward just changing NutchConf access to non static style. I hopefully found some time until next days to do some experiments and will come back with some more details. However we should found a general agreement about the way we go, since changing code in 195 places and lines that depends on it for just nothing is not that funny. Stefan
Static initializers
Hi, This was mentioned before: there are many places in Nutch that rely on static initializers. This is so-so or sometimes plainly bad, depending on a situation. I'm facing a problem now with URLFilters. I need to run several fetchers inside a single VM, with different parameters such as different url patterns (which is handled by URLFilters). But even if I specify different NutchConf-s to each fetcher, the list of implementations and the instances of URLFilter[] in URLFilters are initialized only once, and this happens from the default configuration obtained through a call to static NutchConf.get(). I would like to change it somehow, but I'm not sure how... One way to solve this would be to instantiate the plugins based on a concrete NutchConf instance, like this: URLFilters: private URLFilters(NutchConf) { // initialize plugins based on this instance of NutchConf } public static URLFilters get(NutchConf conf) { URLFilters res = (URLFilters)conf.get(urlfilters.key); if (res == null) { res = new URLFilters(conf); conf.put(urlfilters.key, res); } return res; } In case you are running with a single NutchConf per JVM it doesn't change anything. In case you want to run several different configs in a single JVM this approach provides the solution. We could follow this strategy for other plugin registry facades. Comments? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Static initializers
Andrzej Bialecki wrote: URLFilters: private URLFilters(NutchConf) { // initialize plugins based on this instance of NutchConf } public static URLFilters get(NutchConf conf) { URLFilters res = (URLFilters)conf.get(urlfilters.key); if (res == null) { res = new URLFilters(conf); conf.put(urlfilters.key, res); } return res; } Looking deeper, this is more messy that I thought... Some changes would be required to the plugin instantiation mechanisms, e.g.: Extension.getExtensionInstance() - getExtensionInstance(NutchConf) ExtensionPoint.getExtensions() - getExtensions(NutchConf) PluginRepository.getExtensionPoint(String) - getExtensionPoint(String, NutchConf) etc, etc... The way this would work would be similar to the mechanism described above: if plugin instances are not created yet, they would be created once (based on the current NutchConf argument), and then cached in this NutchConf instance. And also the plugin implementations would have to extend NutchConfigured, taking NutchConf as the argument to their constructors - because now the Extension.getExtensionInstance would pass the current NutchConf instance to their contructors. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Static initializers
Andrzej, How do you choose the NutchConf to use ? Here is a short discussion I had with Doug about a kind of dynamic NutchConf inside the same JVM: ... By looking at the mailing lists archives it seems that having some behavior depending on the documents URL is a recurrent problem (for instance for boosting documents matching a url pattern - NUTCH-16 issue, and many other topics). So, our idea is to provide a way to provide a dynamic nutch configuration (that override the default one, like for the nutch-site) based on documents matching urls pattern. The idea is as follow: 1. The default configuration is as usualy the nutch-default.xml file 2. An xml file can map some url regexp to some many others configurations files (that override the nutch-default): nutch:conf url regexp=http://www.mydomain1.com/*; !-- A set of nutch properties that override the nutch-default for this domain -- property nameproperty1/name valuevalue1/name /property /url /nutch:conf What do you think about this? Looking deeper, this is more messy that I thought... Some changes would be required to the plugin instantiation mechanisms, e.g.: Extension.getExtensionInstance() - getExtensionInstance(NutchConf) ExtensionPoint.getExtensions() - getExtensions(NutchConf) PluginRepository.getExtensionPoint(String) - getExtensionPoint(String, NutchConf) etc, etc... The way this would work would be similar to the mechanism described above: if plugin instances are not created yet, they would be created once (based on the current NutchConf argument), and then cached in this NutchConf instance. And also the plugin implementations would have to extend NutchConfigured, taking NutchConf as the argument to their constructors - because now the Extension.getExtensionInstance would pass the current NutchConf instance to their contructors. That's exactly what I had in mind while speaking about a dynamic NutchConf with Doug. For me it's a +1 The only think I don't really like is extending the NutchConfigured, but it is the most secured way to implement it. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Static initializers
Jérôme Charron wrote: Andrzej, How do you choose the NutchConf to use ? It is provided as an argument to all constructors. Here is a short discussion I had with Doug about a kind of dynamic NutchConf inside the same JVM: ... By looking at the mailing lists archives it seems that having some behavior depending on the documents URL is a recurrent problem (for instance for boosting documents matching a url pattern - NUTCH-16 issue, and many other topics). So, our idea is to provide a way to provide a dynamic nutch configuration (that override the default one, like for the nutch-site) based on documents matching urls pattern. The idea is as follow: Well, it's a neat idea, but it's not necessarily what I was proposing. My proposal could be the first step to implement this. 1. The default configuration is as usualy the nutch-default.xml file 2. An xml file can map some url regexp to some many others configurations files (that override the nutch-default): nutch:conf url regexp=http://www.mydomain1.com/*; !-- A set of nutch properties that override the nutch-default for this domain -- property nameproperty1/name valuevalue1/name /property /url /nutch:conf What do you think about this? Yes, if you can specify different configs for every run, or even for every invocation, it's certainly possible. Looking deeper, this is more messy that I thought... Some changes would be required to the plugin instantiation mechanisms, e.g.: Extension.getExtensionInstance() - getExtensionInstance(NutchConf) ExtensionPoint.getExtensions() - getExtensions(NutchConf) PluginRepository.getExtensionPoint(String) - getExtensionPoint(String, NutchConf) etc, etc... The way this would work would be similar to the mechanism described above: if plugin instances are not created yet, they would be created once (based on the current NutchConf argument), and then cached in this NutchConf instance. And also the plugin implementations would have to extend NutchConfigured, taking NutchConf as the argument to their constructors - because now the Extension.getExtensionInstance would pass the current NutchConf instance to their contructors. That's exactly what I had in mind while speaking about a dynamic NutchConf with Doug. For me it's a +1 The only think I don't really like is extending the NutchConfigured, but it is the most secured way to implement it. Well, it's a form of enforcing a contract for the constructors. There is no other way to do it in Java - you can't specify the required constructors in an interface. OTOH you have the NutchConfigurable interface, which we could use instead, but then you have to remember to call setConf() before you do anything else... I'll work on this to see where it leads. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Static initializers
Hi, right this is a know problem and discussed several times, we should start solving this. :-) I suggest that we make the Plugin Class implementing the Configurable interface. In case a plugin needs any configuration value it will request them from the plugin instance. The next step would be changing the plugin Registry from a singleton to a normal object that need to be instantiated with a nutch configuration object in the constructor. In general I suggest we use a Inversion of control style mechanism (http://www.martinfowler.com/articles/injection.html) to solve these kind of problems, this is from my point of view the cleanest possible solution and allows also changing e.g. configuration objects until runtime. Stefan Am 20.12.2005 um 14:19 schrieb Andrzej Bialecki: Hi, This was mentioned before: there are many places in Nutch that rely on static initializers. This is so-so or sometimes plainly bad, depending on a situation. I'm facing a problem now with URLFilters. I need to run several fetchers inside a single VM, with different parameters such as different url patterns (which is handled by URLFilters). But even if I specify different NutchConf-s to each fetcher, the list of implementations and the instances of URLFilter[] in URLFilters are initialized only once, and this happens from the default configuration obtained through a call to static NutchConf.get(). I would like to change it somehow, but I'm not sure how... One way to solve this would be to instantiate the plugins based on a concrete NutchConf instance, like this: URLFilters: private URLFilters(NutchConf) { // initialize plugins based on this instance of NutchConf } public static URLFilters get(NutchConf conf) { URLFilters res = (URLFilters)conf.get(urlfilters.key); if (res == null) { res = new URLFilters(conf); conf.put(urlfilters.key, res); } return res; } In case you are running with a single NutchConf per JVM it doesn't change anything. In case you want to run several different configs in a single JVM this approach provides the solution. We could follow this strategy for other plugin registry facades. Comments? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com