Re: [Nutch-dev] Plugins initialized all the time!

2007-06-08 Thread Briggs

Well, you could always 'freeze' it, just create a decorator for it.  So,
create a new Configuration (call it ImmutableConfiguration) store the
original configuration object in it, and delegate the methods appropriately.
Wouldn't that work?




On 6/8/07, Doğacan Güney [EMAIL PROTECTED] wrote:


On 5/31/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

  Actually thinking a bit further into this, I kind of agree with you. I
  initially thought that the best approach would be to change
  PluginRepository.get(Configuration) to PluginRepository.get() where
  get() just creates a configuration internally and initializes itself
  with it. But then we wouldn't be passing JobConf to PluginRepository
  but PluginRepository would do something like a
  NutchConfiguration.create(), which is probably wrong.
 
  So, all in all, I've come to believe that my (and Nicolas') patch is a
  not-so-bad way of fixing this. It allows us to pass JobConf to
  PluginRepository and stops creating new PluginRepository-s again and
  again...
 
  What do you think?

 IMO a better way would be to add a proper equals() method to  Hadoop's
 Configuration object (and hashcode) that would call
 getProps().equals(o.getProps()). So that you could use them as keys...
 Every class which is a map from keys to values has equals  hashcode
 (Properties, HashMap, etc.).

 Another nice thing would be to be able to freeze a configuration
 object, preventing anyone from modifying it.



I found that there is already an issue for this problem - NUTCH-356. I
will update it with most recent discussions.

--
Doğacan Güney





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-06-08 Thread Briggs
I should have used the word encapsulate instead of store.  :-)

On 6/8/07, Briggs [EMAIL PROTECTED] wrote:
 Well, you could always 'freeze' it, just create a decorator for it.  So,
 create a new Configuration (call it ImmutableConfiguration) store the
 original configuration object in it, and delegate the methods appropriately.
 Wouldn't that work?





 On 6/8/07, Doğacan Güney [EMAIL PROTECTED] wrote:
  On 5/31/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
  
Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.
   
So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...
   
What do you think?
  
   IMO a better way would be to add a proper equals() method to  Hadoop's
   Configuration object (and hashcode) that would call
   getProps().equals(o.getProps()). So that you could use them as keys...
   Every class which is a map from keys to values has equals  hashcode
   (Properties, HashMap, etc.).
  
   Another nice thing would be to be able to freeze a configuration
   object, preventing anyone from modifying it.
  
  
 
  I found that there is already an issue for this problem - NUTCH-356. I
  will update it with most recent discussions.
 
  --
  Doğacan Güney
 



 --

 Conscious decisions by conscious minds are what make reality real


-- 
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-31 Thread Doğacan Güney
On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
  Doğacan Güney wrote:
 
   My patch is just a draft to see if we can create a better caching
   mechanism. There are definitely some rough edges there:)
 
  One important information: in future versions of Hadoop the method
  Configuration.setObject() is deprecated and then will be removed, so we
  have to grow our own caching mechanism anyway - either use a singleton
  cache, or change nearly all API-s to pass around a user/job/task context.
 
  So, we will face this problem pretty soon, with the next upgrade of Hadoop.

 Hmm, well, that sucks, but this is not really a problem for
 PluginRepository: PluginRepository already has its own cache
 mechanism.

 
 
 
   You are right about per-plugin parameters but I think it will be very
   difficult to keep PluginProperty class in sync with plugin parameters.
   I mean, if a plugin defines a new parameter, we have to remember to
   update PluginProperty. Perhaps, we can force plugins to define
   configuration options it will use in, say, its plugin.xml file, but
   that will be very error-prone too. I don't want to compare entire
   configuration objects, because changing irrevelant options, like
   fetcher.store.content shouldn't force loading plugins again, though it
   seems it may be inevitable
 
  Let me see if I understand this ... In my opinion this is a non-issue.
 
  Child tasks are started in separate JVMs, so the only context
  information that they have is what they can read from job.xml (which is
  a superset of all properties from config files + job-specific data +
  task-specific data). This context is currently instantiated as a
  Configuration object, and we (ab)use it also as a local per-JVM cache
  for plugin instances and other objects.
 
  Once we instantiate the plugins, they exist unchanged throughout the
  lifecycle of JVM (== lifecycle of a single task), so we don't have to
  worry about having different sets of plugins with different parameters
  for different jobs (or even tasks).
 
  In other words, it seems to me that there is no such situation in which
  we have to reload plugins within the same JVM, but with different
  parameters.

 Problem is that someone might get a little too smart. Like one may
 write a new job where he has two IndexingFilters but creates each from
 completely different configuration objects. Then filters some
 documents with the first filter and others with the second. I agree
 that this is a bit of a reach, but it is possible.

Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?



 
  --
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 


 --
 Doğacan Güney



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-31 Thread Nicolás Lichtmaier

 Actually thinking a bit further into this, I kind of agree with you. I
 initially thought that the best approach would be to change
 PluginRepository.get(Configuration) to PluginRepository.get() where
 get() just creates a configuration internally and initializes itself
 with it. But then we wouldn't be passing JobConf to PluginRepository
 but PluginRepository would do something like a
 NutchConfiguration.create(), which is probably wrong.

 So, all in all, I've come to believe that my (and Nicolas') patch is a
 not-so-bad way of fixing this. It allows us to pass JobConf to
 PluginRepository and stops creating new PluginRepository-s again and
 again...

 What do you think?

IMO a better way would be to add a proper equals() method to  Hadoop's 
Configuration object (and hashcode) that would call 
getProps().equals(o.getProps()). So that you could use them as keys... 
Every class which is a map from keys to values has equals  hashcode 
(Properties, HashMap, etc.).

Another nice thing would be to be able to freeze a configuration 
object, preventing anyone from modifying it.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney
Hi,

On 5/29/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

  Which job causes the problem? Perhaps, we can find out what keeps
  creating a conf object over and over.
 
  Also, I have tried what you have suggested (better caching for plugin
  repository) and it really seems to make a difference. Can you try with
  this patch(*) to see if it solves your problem?
 
  (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

 Some comments about you patch. The approach seems nice, you only check
 the parameters that affect plugin loading. But have in mind that the
 plugin themselves will configure themselves with many other parameters,
 so to keep things safe there should be a PluginRepository for each set
 of parameters (including all of them). Besides, remember that CACHE is a
 WeakHashMap, you are creating ad-hoc PluginProperty objects as keys,
 something doesn't loook right... the lifespan of those objects will be
 much shorter than you require, perhaps you should be using
 SoftReferences instead, or a simple LRU (LinkedHashMap provides that
 simply) cache.

My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)

I don't really worry about WeakHashMap-LinkedHashMap stuff. But your
approach is simple and should be faster so I guess it's OK.

You are right about per-plugin parameters but I think it will be very
difficult to keep PluginProperty class in sync with plugin parameters.
I mean, if a plugin defines a new parameter, we have to remember to
update PluginProperty. Perhaps, we can force plugins to define
configuration options it will use in, say, its plugin.xml file, but
that will be very error-prone too. I don't want to compare entire
configuration objects, because changing irrevelant options, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable


 Anyway, I'll try to build my own Nutch to test your patch.

 Thanks!




-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Andrzej Bialecki
Doğacan Güney wrote:

 My patch is just a draft to see if we can create a better caching
 mechanism. There are definitely some rough edges there:)

One important information: in future versions of Hadoop the method 
Configuration.setObject() is deprecated and then will be removed, so we 
have to grow our own caching mechanism anyway - either use a singleton 
cache, or change nearly all API-s to pass around a user/job/task context.

So, we will face this problem pretty soon, with the next upgrade of Hadoop.



 You are right about per-plugin parameters but I think it will be very
 difficult to keep PluginProperty class in sync with plugin parameters.
 I mean, if a plugin defines a new parameter, we have to remember to
 update PluginProperty. Perhaps, we can force plugins to define
 configuration options it will use in, say, its plugin.xml file, but
 that will be very error-prone too. I don't want to compare entire
 configuration objects, because changing irrevelant options, like
 fetcher.store.content shouldn't force loading plugins again, though it
 seems it may be inevitable

Let me see if I understand this ... In my opinion this is a non-issue.

Child tasks are started in separate JVMs, so the only context 
information that they have is what they can read from job.xml (which is 
a superset of all properties from config files + job-specific data + 
task-specific data). This context is currently instantiated as a 
Configuration object, and we (ab)use it also as a local per-JVM cache 
for plugin instances and other objects.

Once we instantiate the plugins, they exist unchanged throughout the 
lifecycle of JVM (== lifecycle of a single task), so we don't have to 
worry about having different sets of plugins with different parameters 
for different jobs (or even tasks).

In other words, it seems to me that there is no such situation in which 
we have to reload plugins within the same JVM, but with different 
parameters.

-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney
On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  My patch is just a draft to see if we can create a better caching
  mechanism. There are definitely some rough edges there:)

 One important information: in future versions of Hadoop the method
 Configuration.setObject() is deprecated and then will be removed, so we
 have to grow our own caching mechanism anyway - either use a singleton
 cache, or change nearly all API-s to pass around a user/job/task context.

 So, we will face this problem pretty soon, with the next upgrade of Hadoop.

Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.




  You are right about per-plugin parameters but I think it will be very
  difficult to keep PluginProperty class in sync with plugin parameters.
  I mean, if a plugin defines a new parameter, we have to remember to
  update PluginProperty. Perhaps, we can force plugins to define
  configuration options it will use in, say, its plugin.xml file, but
  that will be very error-prone too. I don't want to compare entire
  configuration objects, because changing irrevelant options, like
  fetcher.store.content shouldn't force loading plugins again, though it
  seems it may be inevitable

 Let me see if I understand this ... In my opinion this is a non-issue.

 Child tasks are started in separate JVMs, so the only context
 information that they have is what they can read from job.xml (which is
 a superset of all properties from config files + job-specific data +
 task-specific data). This context is currently instantiated as a
 Configuration object, and we (ab)use it also as a local per-JVM cache
 for plugin instances and other objects.

 Once we instantiate the plugins, they exist unchanged throughout the
 lifecycle of JVM (== lifecycle of a single task), so we don't have to
 worry about having different sets of plugins with different parameters
 for different jobs (or even tasks).

 In other words, it seems to me that there is no such situation in which
 we have to reload plugins within the same JVM, but with different
 parameters.

Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.



 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney
Hi,

On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
 I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
 that the plugin repository initializes itself all the timem until I get
 an out of memory exception. I've been seeing the code... the plugin
 repository mantains a map from Configuration to plugin repositories, but
 the Configuration object does not have an equals or hashCode method...
 wouldn't it be nice to add such a method (comparing property values)?
 Wouldn't that help prevent initializing many plugin repositories? What
 could be the cause to may problem? (Aaah.. so many questions... =) )

Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


 Bye!



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-29 Thread Briggs
I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.





On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 Hi,

 On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
  I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
  that the plugin repository initializes itself all the timem until I get
  an out of memory exception. I've been seeing the code... the plugin
  repository mantains a map from Configuration to plugin repositories, but
  the Configuration object does not have an equals or hashCode method...
  wouldn't it be nice to add such a method (comparing property values)?
  Wouldn't that help prevent initializing many plugin repositories? What
  could be the cause to may problem? (Aaah.. so many questions... =) )

 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

 
  Bye!
 


 --
 Doğacan Güney



-- 
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney
On 5/29/07, Briggs [EMAIL PROTECTED] wrote:
 I have also noticed this. The code explicitly loads an instance of the
 plugins for every fetch (well, or parse etc., depending on what you
 are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
 you can see the filter classes get loaded and the never get unloaded
 (they are loaded within their own classloader). So, you'll see the
 same class loaded thousands of time, which is bad.

 So, in my case, I had to change the way the plugins are loaded.
 Basically, I changed all the main plugin loaders (like
 URLFilters.java, IndexFilters.java) to be singletons with a single
 'getInstance()' method on each. I don't need special configs for
 filters so I can deal with singletons.

 You'll find the heart of the problem somewhere in the extension point
 class(es).  It calls newInstance() an aweful lot. But, the classloader
 (one per plugin) never gets destroyed, or something so this can be
 nasty.

 I'm still dealing with my OutOfMemory errors on parsing, yuck.

Well then can you test the patch too? Nicolas's idea seems to be the
right one. After this patch, I think plugin loaders will see the same
PluginRepository instance.






 On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
  Hi,
 
  On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
   I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
   that the plugin repository initializes itself all the timem until I get
   an out of memory exception. I've been seeing the code... the plugin
   repository mantains a map from Configuration to plugin repositories, but
   the Configuration object does not have an equals or hashCode method...
   wouldn't it be nice to add such a method (comparing property values)?
   Wouldn't that help prevent initializing many plugin repositories? What
   could be the cause to may problem? (Aaah.. so many questions... =) )
 
  Which job causes the problem? Perhaps, we can find out what keeps
  creating a conf object over and over.
 
  Also, I have tried what you have suggested (better caching for plugin
  repository) and it really seems to make a difference. Can you try with
  this patch(*) to see if it solves your problem?
 
  (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
 
  
   Bye!
  
 
 
  --
  Doğacan Güney
 


 --
 Conscious decisions by conscious minds are what make reality real



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-29 Thread Briggs
I'll have to get around to trying this in the future.  I have already
'forked' the code. But, would like to get back on track too.  So,
guess I will post something, someday.   The plugin part is now the
least of my worries.  Again, the parsing is what is killing me now.  I
don't use nutch in the 'out-of-the-box' fashion.  My app is running in
a container that crawls when messages to crawl are received.

On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 On 5/29/07, Briggs [EMAIL PROTECTED] wrote:
  I have also noticed this. The code explicitly loads an instance of the
  plugins for every fetch (well, or parse etc., depending on what you
  are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
  you can see the filter classes get loaded and the never get unloaded
  (they are loaded within their own classloader). So, you'll see the
  same class loaded thousands of time, which is bad.
 
  So, in my case, I had to change the way the plugins are loaded.
  Basically, I changed all the main plugin loaders (like
  URLFilters.java, IndexFilters.java) to be singletons with a single
  'getInstance()' method on each. I don't need special configs for
  filters so I can deal with singletons.
 
  You'll find the heart of the problem somewhere in the extension point
  class(es).  It calls newInstance() an aweful lot. But, the classloader
  (one per plugin) never gets destroyed, or something so this can be
  nasty.
 
  I'm still dealing with my OutOfMemory errors on parsing, yuck.

 Well then can you test the patch too? Nicolas's idea seems to be the
 right one. After this patch, I think plugin loaders will see the same
 PluginRepository instance.

 
 
 
 
 
  On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
   Hi,
  
   On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )
  
   Which job causes the problem? Perhaps, we can find out what keeps
   creating a conf object over and over.
  
   Also, I have tried what you have suggested (better caching for plugin
   repository) and it really seems to make a difference. Can you try with
   this patch(*) to see if it solves your problem?
  
   (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
  
   
Bye!
   
  
  
   --
   Doğacan Güney
  
 
 
  --
  Conscious decisions by conscious minds are what make reality real
 


 --
 Doğacan Güney



-- 
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-29 Thread Nicolás Lichtmaier

 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

Some comments about you patch. The approach seems nice, you only check 
the parameters that affect plugin loading. But have in mind that the 
plugin themselves will configure themselves with many other parameters, 
so to keep things safe there should be a PluginRepository for each set 
of parameters (including all of them). Besides, remember that CACHE is a 
WeakHashMap, you are creating ad-hoc PluginProperty objects as keys, 
something doesn't loook right... the lifespan of those objects will be 
much shorter than you require, perhaps you should be using 
SoftReferences instead, or a simple LRU (LinkedHashMap provides that 
simply) cache.

Anyway, I'll try to build my own Nutch to test your patch.

Thanks!


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-29 Thread Nicolás Lichtmaier

 I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
 that the plugin repository initializes itself all the timem until I get
 an out of memory exception. I've been seeing the code... the plugin
 repository mantains a map from Configuration to plugin repositories, but
 the Configuration object does not have an equals or hashCode method...
 wouldn't it be nice to add such a method (comparing property values)?
 Wouldn't that help prevent initializing many plugin repositories? What
 could be the cause to may problem? (Aaah.. so many questions... =) )

 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

I'm running it. So far it's working ok, and I haven't seen all those 
plugin loadings...

I've modified your patch though to define CACHE like this:

  private static final MapPluginProperty, PluginRepository CACHE =
  new LinkedHashMapPluginProperty, PluginRepository() {
@Override
protected boolean removeEldestEntry(
EntryPluginProperty, PluginRepository eldest) {
  return size()  10;
}
  };

...which means an LRU cache with a fixed size of 10.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Plugins initialized all the time!

2007-05-28 Thread Nicolás Lichtmaier

More info...

I see map progressing from 0% to 100. It seems to reload plugins whan 
reaching 100%. Besides, I've realized that each NutchJob is a 
Configuration, so (as is there's no equals) a plugin repo would be 
created per each NutchJob...


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers