----- Original Message ----- 
From: "Doğacan Güney" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, June 08, 2007 8:27 PM
Subject: Re: Loading mechnism of plugin classes and singleton objects

[...]
>> This is strange, because, as you can see below, the strings that
>> make keys and values of conf appears unchanged. Perhaps we should
>> override
>> the equals() method in org.apache.hadoop.conf.Configuration (invoked by
>> CACHE.get(), according to the specs of the java.util.Map interface), so
>> that
>> the hashCode()s of the keys get ignored, and conf1.equals(conf2) return
>> true
>> if and only if:
>>
>>  1. conf1.size() == conf2.size(),
>>
>>  2. for each key k1 of conf1 there is a key k2 in conf2 such as:
>>   2.1 k1.equals(k2)
>>   2.2 conf1.get(k1).equals(conf2.get(k2))
>
> This has been suggested before and I have to say I don't like this
> one, because this means that each call to PluginRepository.get(conf)
> will end up comparing all key value pairs, which, IMO, is
> excessive(because if I am not mistaken, we don't need this when
> running nutch in a distributed environment.). Unfortunately, this may
> be the only way to fix this leak.

Well, after all how often can this overhead ever happen? I logged just a few
calls (admittedly, on short runs) but even thousands of occurrences would
only consume a negligible amount of CPU time.
A more serious problem is that an implementation of equals() that returns
true if the two hashCodes are different violates the specifications of
Object.hashCode() :

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Object.html#hashCode()
"If two objects are equal according to the equals(Object) method, then
calling the hashCode method on each of the two objects must produce the same
integer result."

> This is probably not a good idea, but here it goes: Perhaps, we can
> change LocaJobRunner.Job.run method. First, it clones the JobConf
> object (clonedConf).  It then runs a MapTask with the original
> JobConf. Upon completion, it copies everything from clonedConf back to
> original JobConf. This way, original JobConf's hashCode won't change,
> so there should be no leak.

What if we simply rework the PluginRepository.get() method, checking first
if there is key "deeply" equal (i.e., with the same key/value pairs but
possibly different hashCode) to conf is in CACHE, and in that case if the
hashCode is different we use that key as parameter of the get() method, in
place of conf? (This within a synchronized block, of course.) Probably the
cleanest way do do that is to encapsulate this behavior in a
"hashCodeObliviousGet()" method of a subclass of Hashtable, and use that
subclass for CACHE. So, we will be "specs-ly correct" because we won't
override common methods (get() or equals()) with implementations that are
not compliant with the specs.

Also: do you agree with me that we should avoid caches based on weak
references (like the current WeakHashMap) if we want to be protected against
multiple instances of plugins? I can't see a compelling reason for a
garbage-collected CACHE, considering that the number of items stored in it
can't grow that large (how many different configurations can users ever
create?)

Enzo


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to