Plugins initialized all the time!
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Bye!
Re: Plugins initialized all the time!
More info... I see "map" progressing from 0% to 100. It seems to reload plugins whan reaching 100%. Besides, I've realized that each NutchJob is a Configuration, so (as is there's no "equals") a plugin repo would be created per each NutchJob...
Re: Plugins initialized all the time!
Hi, On 5/28/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Bye! -- Doğacan Güney
Re: Plugins initialized all the time!
I have also noticed this. The code explicitly loads an instance of the plugins for every fetch (well, or parse etc., depending on what you are doing). This causes OutOfMemoryErrors. So, if you dump the heap, you can see the filter classes get loaded and the never get unloaded (they are loaded within their own classloader). So, you'll see the same class loaded thousands of time, which is bad. So, in my case, I had to change the way the plugins are loaded. Basically, I changed all the main plugin loaders (like URLFilters.java, IndexFilters.java) to be singletons with a single 'getInstance()' method on each. I don't need special configs for filters so I can deal with singletons. You'll find the heart of the problem somewhere in the extension point class(es). It calls newInstance() an aweful lot. But, the classloader (one per plugin) never gets destroyed, or something so this can be nasty. I'm still dealing with my OutOfMemory errors on parsing, yuck. On 5/29/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: Hi, On 5/28/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems > that the plugin repository initializes itself all the timem until I get > an out of memory exception. I've been seeing the code... the plugin > repository mantains a map from Configuration to plugin repositories, but > the Configuration object does not have an equals or hashCode method... > wouldn't it be nice to add such a method (comparing property values)? > Wouldn't that help prevent initializing many plugin repositories? What > could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch > > Bye! > -- Doğacan Güney -- "Conscious decisions by conscious minds are what make reality real"
Re: Plugins initialized all the time!
On 5/29/07, Briggs <[EMAIL PROTECTED]> wrote: I have also noticed this. The code explicitly loads an instance of the plugins for every fetch (well, or parse etc., depending on what you are doing). This causes OutOfMemoryErrors. So, if you dump the heap, you can see the filter classes get loaded and the never get unloaded (they are loaded within their own classloader). So, you'll see the same class loaded thousands of time, which is bad. So, in my case, I had to change the way the plugins are loaded. Basically, I changed all the main plugin loaders (like URLFilters.java, IndexFilters.java) to be singletons with a single 'getInstance()' method on each. I don't need special configs for filters so I can deal with singletons. You'll find the heart of the problem somewhere in the extension point class(es). It calls newInstance() an aweful lot. But, the classloader (one per plugin) never gets destroyed, or something so this can be nasty. I'm still dealing with my OutOfMemory errors on parsing, yuck. Well then can you test the patch too? Nicolas's idea seems to be the right one. After this patch, I think plugin loaders will see the same PluginRepository instance. On 5/29/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > Hi, > > On 5/28/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > > I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems > > that the plugin repository initializes itself all the timem until I get > > an out of memory exception. I've been seeing the code... the plugin > > repository mantains a map from Configuration to plugin repositories, but > > the Configuration object does not have an equals or hashCode method... > > wouldn't it be nice to add such a method (comparing property values)? > > Wouldn't that help prevent initializing many plugin repositories? What > > could be the cause to may problem? (Aaah.. so many questions... =) ) > > Which job causes the problem? Perhaps, we can find out what keeps > creating a conf object over and over. > > Also, I have tried what you have suggested (better caching for plugin > repository) and it really seems to make a difference. Can you try with > this patch(*) to see if it solves your problem? > > (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch > > > > > Bye! > > > > > -- > Doğacan Güney > -- "Conscious decisions by conscious minds are what make reality real" -- Doğacan Güney
Re: Plugins initialized all the time!
I'll have to get around to trying this in the future. I have already 'forked' the code. But, would like to get back on track too. So, guess I will post something, someday. The plugin part is now the least of my worries. Again, the parsing is what is killing me now. I don't use nutch in the 'out-of-the-box' fashion. My app is running in a container that crawls when messages to crawl are received. On 5/29/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: On 5/29/07, Briggs <[EMAIL PROTECTED]> wrote: > I have also noticed this. The code explicitly loads an instance of the > plugins for every fetch (well, or parse etc., depending on what you > are doing). This causes OutOfMemoryErrors. So, if you dump the heap, > you can see the filter classes get loaded and the never get unloaded > (they are loaded within their own classloader). So, you'll see the > same class loaded thousands of time, which is bad. > > So, in my case, I had to change the way the plugins are loaded. > Basically, I changed all the main plugin loaders (like > URLFilters.java, IndexFilters.java) to be singletons with a single > 'getInstance()' method on each. I don't need special configs for > filters so I can deal with singletons. > > You'll find the heart of the problem somewhere in the extension point > class(es). It calls newInstance() an aweful lot. But, the classloader > (one per plugin) never gets destroyed, or something so this can be > nasty. > > I'm still dealing with my OutOfMemory errors on parsing, yuck. Well then can you test the patch too? Nicolas's idea seems to be the right one. After this patch, I think plugin loaders will see the same PluginRepository instance. > > > > > > On 5/29/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > Hi, > > > > On 5/28/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > > > I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems > > > that the plugin repository initializes itself all the timem until I get > > > an out of memory exception. I've been seeing the code... the plugin > > > repository mantains a map from Configuration to plugin repositories, but > > > the Configuration object does not have an equals or hashCode method... > > > wouldn't it be nice to add such a method (comparing property values)? > > > Wouldn't that help prevent initializing many plugin repositories? What > > > could be the cause to may problem? (Aaah.. so many questions... =) ) > > > > Which job causes the problem? Perhaps, we can find out what keeps > > creating a conf object over and over. > > > > Also, I have tried what you have suggested (better caching for plugin > > repository) and it really seems to make a difference. Can you try with > > this patch(*) to see if it solves your problem? > > > > (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch > > > > > > > > Bye! > > > > > > > > > -- > > Doğacan Güney > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- Doğacan Güney -- "Conscious decisions by conscious minds are what make reality real"
Re: Plugins initialized all the time!
Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Some comments about you patch. The approach seems nice, you only check the parameters that affect plugin loading. But have in mind that the plugin themselves will configure themselves with many other parameters, so to keep things safe there should be a PluginRepository for each set of parameters (including all of them). Besides, remember that CACHE is a WeakHashMap, you are creating ad-hoc PluginProperty objects as keys, something doesn't loook right... the lifespan of those objects will be much shorter than you require, perhaps you should be using SoftReferences instead, or a simple LRU (LinkedHashMap provides that simply) cache. Anyway, I'll try to build my own Nutch to test your patch. Thanks!
Re: Plugins initialized all the time!
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch I'm running it. So far it's working ok, and I haven't seen all those plugin loadings... I've modified your patch though to define CACHE like this: private static final Map CACHE = new LinkedHashMap() { @Override protected boolean removeEldestEntry( Entry eldest) { return size() > 10; } }; ...which means an LRU cache with a fixed size of 10.
Re: Plugins initialized all the time!
Hi, On 5/29/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > Which job causes the problem? Perhaps, we can find out what keeps > creating a conf object over and over. > > Also, I have tried what you have suggested (better caching for plugin > repository) and it really seems to make a difference. Can you try with > this patch(*) to see if it solves your problem? > > (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Some comments about you patch. The approach seems nice, you only check the parameters that affect plugin loading. But have in mind that the plugin themselves will configure themselves with many other parameters, so to keep things safe there should be a PluginRepository for each set of parameters (including all of them). Besides, remember that CACHE is a WeakHashMap, you are creating ad-hoc PluginProperty objects as keys, something doesn't loook right... the lifespan of those objects will be much shorter than you require, perhaps you should be using SoftReferences instead, or a simple LRU (LinkedHashMap provides that simply) cache. My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) I don't really worry about WeakHashMap->LinkedHashMap stuff. But your approach is simple and should be faster so I guess it's OK. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Anyway, I'll try to build my own Nutch to test your patch. Thanks! -- Doğacan Güney
Re: Plugins initialized all the time!
Doğacan Güney wrote: My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) One important information: in future versions of Hadoop the method Configuration.setObject() is deprecated and then will be removed, so we have to grow our own caching mechanism anyway - either use a singleton cache, or change nearly all API-s to pass around a user/job/task context. So, we will face this problem pretty soon, with the next upgrade of Hadoop. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Let me see if I understand this ... In my opinion this is a non-issue. Child tasks are started in separate JVMs, so the only "context" information that they have is what they can read from job.xml (which is a superset of all properties from config files + job-specific data + task-specific data). This context is currently instantiated as a Configuration object, and we (ab)use it also as a local per-JVM cache for plugin instances and other objects. Once we instantiate the plugins, they exist unchanged throughout the lifecycle of JVM (== lifecycle of a single task), so we don't have to worry about having different sets of plugins with different parameters for different jobs (or even tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Plugins initialized all the time!
On 5/30/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Doğacan Güney wrote: > My patch is just a draft to see if we can create a better caching > mechanism. There are definitely some rough edges there:) One important information: in future versions of Hadoop the method Configuration.setObject() is deprecated and then will be removed, so we have to grow our own caching mechanism anyway - either use a singleton cache, or change nearly all API-s to pass around a user/job/task context. So, we will face this problem pretty soon, with the next upgrade of Hadoop. Hmm, well, that sucks, but this is not really a problem for PluginRepository: PluginRepository already has its own cache mechanism. > You are right about per-plugin parameters but I think it will be very > difficult to keep PluginProperty class in sync with plugin parameters. > I mean, if a plugin defines a new parameter, we have to remember to > update PluginProperty. Perhaps, we can force plugins to define > configuration options it will use in, say, its plugin.xml file, but > that will be very error-prone too. I don't want to compare entire > configuration objects, because changing irrevelant options, like > fetcher.store.content shouldn't force loading plugins again, though it > seems it may be inevitable Let me see if I understand this ... In my opinion this is a non-issue. Child tasks are started in separate JVMs, so the only "context" information that they have is what they can read from job.xml (which is a superset of all properties from config files + job-specific data + task-specific data). This context is currently instantiated as a Configuration object, and we (ab)use it also as a local per-JVM cache for plugin instances and other objects. Once we instantiate the plugins, they exist unchanged throughout the lifecycle of JVM (== lifecycle of a single task), so we don't have to worry about having different sets of plugins with different parameters for different jobs (or even tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. Problem is that someone might get a little too smart. Like one may write a new job where he has two IndexingFilters but creates each from completely different configuration objects. Then filters some documents with the first filter and others with the second. I agree that this is a bit of a reach, but it is possible. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Plugins initialized all the time!
On 5/30/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: On 5/30/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Doğacan Güney wrote: > > > My patch is just a draft to see if we can create a better caching > > mechanism. There are definitely some rough edges there:) > > One important information: in future versions of Hadoop the method > Configuration.setObject() is deprecated and then will be removed, so we > have to grow our own caching mechanism anyway - either use a singleton > cache, or change nearly all API-s to pass around a user/job/task context. > > So, we will face this problem pretty soon, with the next upgrade of Hadoop. Hmm, well, that sucks, but this is not really a problem for PluginRepository: PluginRepository already has its own cache mechanism. > > > > > You are right about per-plugin parameters but I think it will be very > > difficult to keep PluginProperty class in sync with plugin parameters. > > I mean, if a plugin defines a new parameter, we have to remember to > > update PluginProperty. Perhaps, we can force plugins to define > > configuration options it will use in, say, its plugin.xml file, but > > that will be very error-prone too. I don't want to compare entire > > configuration objects, because changing irrevelant options, like > > fetcher.store.content shouldn't force loading plugins again, though it > > seems it may be inevitable > > Let me see if I understand this ... In my opinion this is a non-issue. > > Child tasks are started in separate JVMs, so the only "context" > information that they have is what they can read from job.xml (which is > a superset of all properties from config files + job-specific data + > task-specific data). This context is currently instantiated as a > Configuration object, and we (ab)use it also as a local per-JVM cache > for plugin instances and other objects. > > Once we instantiate the plugins, they exist unchanged throughout the > lifecycle of JVM (== lifecycle of a single task), so we don't have to > worry about having different sets of plugins with different parameters > for different jobs (or even tasks). > > In other words, it seems to me that there is no such situation in which > we have to reload plugins within the same JVM, but with different > parameters. Problem is that someone might get a little too smart. Like one may write a new job where he has two IndexingFilters but creates each from completely different configuration objects. Then filters some documents with the first filter and others with the second. I agree that this is a bit of a reach, but it is possible. Actually thinking a bit further into this, I kind of agree with you. I initially thought that the best approach would be to change PluginRepository.get(Configuration) to PluginRepository.get() where get() just creates a configuration internally and initializes itself with it. But then we wouldn't be passing JobConf to PluginRepository but PluginRepository would do something like a NutchConfiguration.create(), which is probably wrong. So, all in all, I've come to believe that my (and Nicolas') patch is a not-so-bad way of fixing this. It allows us to pass JobConf to PluginRepository and stops creating new PluginRepository-s again and again... What do you think? > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney -- Doğacan Güney
Re: Plugins initialized all the time!
Actually thinking a bit further into this, I kind of agree with you. I initially thought that the best approach would be to change PluginRepository.get(Configuration) to PluginRepository.get() where get() just creates a configuration internally and initializes itself with it. But then we wouldn't be passing JobConf to PluginRepository but PluginRepository would do something like a NutchConfiguration.create(), which is probably wrong. So, all in all, I've come to believe that my (and Nicolas') patch is a not-so-bad way of fixing this. It allows us to pass JobConf to PluginRepository and stops creating new PluginRepository-s again and again... What do you think? IMO a better way would be to add a proper equals() method to Hadoop's Configuration object (and hashcode) that would call getProps().equals(o.getProps()). So that you could use them as keys... Every class which is a map from keys to values has "equals & hashcode" (Properties, HashMap, etc.). Another nice thing would be to be able to "freeze" a configuration object, preventing anyone from modifying it.
Re: Plugins initialized all the time!
On 5/31/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > Actually thinking a bit further into this, I kind of agree with you. I > initially thought that the best approach would be to change > PluginRepository.get(Configuration) to PluginRepository.get() where > get() just creates a configuration internally and initializes itself > with it. But then we wouldn't be passing JobConf to PluginRepository > but PluginRepository would do something like a > NutchConfiguration.create(), which is probably wrong. > > So, all in all, I've come to believe that my (and Nicolas') patch is a > not-so-bad way of fixing this. It allows us to pass JobConf to > PluginRepository and stops creating new PluginRepository-s again and > again... > > What do you think? IMO a better way would be to add a proper equals() method to Hadoop's Configuration object (and hashcode) that would call getProps().equals(o.getProps()). So that you could use them as keys... Every class which is a map from keys to values has "equals & hashcode" (Properties, HashMap, etc.). Another nice thing would be to be able to "freeze" a configuration object, preventing anyone from modifying it. I found that there is already an issue for this problem - NUTCH-356. I will update it with most recent discussions. -- Doğacan Güney
Re: Plugins initialized all the time!
Well, you could always 'freeze' it, just create a decorator for it. So, create a new Configuration (call it ImmutableConfiguration) store the original configuration object in it, and delegate the methods appropriately. Wouldn't that work? On 6/8/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: On 5/31/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > > > Actually thinking a bit further into this, I kind of agree with you. I > > initially thought that the best approach would be to change > > PluginRepository.get(Configuration) to PluginRepository.get() where > > get() just creates a configuration internally and initializes itself > > with it. But then we wouldn't be passing JobConf to PluginRepository > > but PluginRepository would do something like a > > NutchConfiguration.create(), which is probably wrong. > > > > So, all in all, I've come to believe that my (and Nicolas') patch is a > > not-so-bad way of fixing this. It allows us to pass JobConf to > > PluginRepository and stops creating new PluginRepository-s again and > > again... > > > > What do you think? > > IMO a better way would be to add a proper equals() method to Hadoop's > Configuration object (and hashcode) that would call > getProps().equals(o.getProps()). So that you could use them as keys... > Every class which is a map from keys to values has "equals & hashcode" > (Properties, HashMap, etc.). > > Another nice thing would be to be able to "freeze" a configuration > object, preventing anyone from modifying it. > > I found that there is already an issue for this problem - NUTCH-356. I will update it with most recent discussions. -- Doğacan Güney -- "Conscious decisions by conscious minds are what make reality real"
Re: Plugins initialized all the time!
I should have used the word "encapsulate" instead of "store". :-) On 6/8/07, Briggs <[EMAIL PROTECTED]> wrote: Well, you could always 'freeze' it, just create a decorator for it. So, create a new Configuration (call it ImmutableConfiguration) store the original configuration object in it, and delegate the methods appropriately. Wouldn't that work? On 6/8/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > On 5/31/07, Nicolás Lichtmaier <[EMAIL PROTECTED]> wrote: > > > > > Actually thinking a bit further into this, I kind of agree with you. I > > > initially thought that the best approach would be to change > > > PluginRepository.get(Configuration) to PluginRepository.get() where > > > get() just creates a configuration internally and initializes itself > > > with it. But then we wouldn't be passing JobConf to PluginRepository > > > but PluginRepository would do something like a > > > NutchConfiguration.create(), which is probably wrong. > > > > > > So, all in all, I've come to believe that my (and Nicolas') patch is a > > > not-so-bad way of fixing this. It allows us to pass JobConf to > > > PluginRepository and stops creating new PluginRepository-s again and > > > again... > > > > > > What do you think? > > > > IMO a better way would be to add a proper equals() method to Hadoop's > > Configuration object (and hashcode) that would call > > getProps().equals(o.getProps()). So that you could use them as keys... > > Every class which is a map from keys to values has "equals & hashcode" > > (Properties, HashMap, etc.). > > > > Another nice thing would be to be able to "freeze" a configuration > > object, preventing anyone from modifying it. > > > > > > I found that there is already an issue for this problem - NUTCH-356. I > will update it with most recent discussions. > > -- > Doğacan Güney > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"