On 6/8/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
> ----- Original Message -----
> From: "Doğacan Güney" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, June 08, 2007 8:27 PM
> Subject: Re: Loading mechnism of plugin classes and singleton objects
>
> [...]
> >> This is strange, because, as you can see below, the strings that
> >> make keys and values of conf appears unchanged. Perhaps we should
> >> override
> >> the equals() method in org.apache.hadoop.conf.Configuration (invoked by
> >> CACHE.get(), according to the specs of the java.util.Map interface), so
> >> that
> >> the hashCode()s of the keys get ignored, and conf1.equals(conf2) return
> >> true
> >> if and only if:
> >>
> >>  1. conf1.size() == conf2.size(),
> >>
> >>  2. for each key k1 of conf1 there is a key k2 in conf2 such as:
> >>   2.1 k1.equals(k2)
> >>   2.2 conf1.get(k1).equals(conf2.get(k2))
> >
> > This has been suggested before and I have to say I don't like this
> > one, because this means that each call to PluginRepository.get(conf)
> > will end up comparing all key value pairs, which, IMO, is
> > excessive(because if I am not mistaken, we don't need this when
> > running nutch in a distributed environment.). Unfortunately, this may
> > be the only way to fix this leak.
>
> Well, after all how often can this overhead ever happen? I logged just a few
> calls (admittedly, on short runs) but even thousands of occurrences would
> only consume a negligible amount of CPU time.

There are a couple of places where the overhead will cost us right
now. For example, current ParseSegment code around line 76:

      parseResult = new ParseUtil(getConf()).parse(content);

This is called for every record. So we will end up doing a lot of
comparisons. Obviously we can fix this, but I don't like the idea that
a get method causes overhead. Anyway, that is just my personal
preference, so if people are feeling very strongly about it,
"deep-comparison" is fine with me.

> A more serious problem is that an implementation of equals() that returns
> true if the two hashCodes are different violates the specifications of
> Object.hashCode() :
>
> http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Object.html#hashCode()
> "If two objects are equal according to the equals(Object) method, then
> calling the hashCode method on each of the two objects must produce the same
> integer result."

We can just update Configuration.hashCode to calculate hash by summing
hashCode's of all key,value pairs. This should make it equal,
shouldn't it?

>
> > This is probably not a good idea, but here it goes: Perhaps, we can
> > change LocaJobRunner.Job.run method. First, it clones the JobConf
> > object (clonedConf).  It then runs a MapTask with the original
> > JobConf. Upon completion, it copies everything from clonedConf back to
> > original JobConf. This way, original JobConf's hashCode won't change,
> > so there should be no leak.
>
> What if we simply rework the PluginRepository.get() method, checking first
> if there is key "deeply" equal (i.e., with the same key/value pairs but
> possibly different hashCode) to conf is in CACHE, and in that case if the
> hashCode is different we use that key as parameter of the get() method, in
> place of conf? (This within a synchronized block, of course.) Probably the
> cleanest way do do that is to encapsulate this behavior in a
> "hashCodeObliviousGet()" method of a subclass of Hashtable, and use that
> subclass for CACHE. So, we will be "specs-ly correct" because we won't
> override common methods (get() or equals()) with implementations that are
> not compliant with the specs.
>
> Also: do you agree with me that we should avoid caches based on weak
> references (like the current WeakHashMap) if we want to be protected against
> multiple instances of plugins? I can't see a compelling reason for a
> garbage-collected CACHE, considering that the number of items stored in it
> can't grow that large (how many different configurations can users ever
> create?)

I think so too. When a map task ends and another begins, there will be
no strong references to the configuration object of the previous map
task, so it may be garbage-collected. Nicolas Lichtmaier has a patch
for this to change WeakHashMap to a form of LRU map.

BTW, this problem has been discussed before (most recently at
http://www.nabble.com/Plugins-initialized-all-the-time ). There even
is an open issue for this - NUTCH-356. I would suggest that we move
our discussion there so that we can all work on this together and fix
this once and for all. I will update the issue with the most recent
discussions.

>
> Enzo
>
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to