Re: [opengrok-dev] [Bug 19215] OpenGrok leaks file descriptors

Jens Elkner Tue, 12 Jun 2012 00:49:04 -0700

On Mon, Jun 11, 2012 at 01:40:05PM +0000, [email protected] wrote:
> http://defect.opensolaris.org/bz/show_bug.cgi?id=19215
> 
> --- Comment #15 from Gustavo Lopes <[email protected]> 2012-06-11 
> 13:39:58 UTC ---
> Created attachment 4665
>   --> http://defect.opensolaris.org/bz/attachment.cgi?id=4665
> 
> This is my reply to the code review. Unfortunately, Jens mail server rejected
> my e-mail (said it was spam) and the message to opengrok-dev was held in
> moderation.

Yepp, Spam[A] == Autolearned. Cause was the wolframalpha.com link - the
used DNS registrar has bad reputation ... ;-)

Gustavo Lopes wrote:
> On Mon, 11 Jun 2012 07:41:44 +0200, Jens Elkner wrote:
> > Wrt. to the webrev/Gustvos patches:
...
> > RuntimeEnvironment.getInstance()[.getConfiguration()].getSearchPool[Size]()
> >      Furthermore, even if obtained via RuntimeEnvironment, shouldn't 
> > lazy
> >      instantiation should be preferred?
> >      IMHO: no changes required here
> 
> I'm not sure I understand. Lazy instantiation of an int?

Yes - I was a little bit lazy wrt. typing and used [] to mark the
optional parts -> many combinations ;-) But from a quick skimming of the
new changes I saw, that it is going into the right direction ...

Here my thoughts on it, but don't feel to be bound to it - I'm just a
contributor as you are :-)

Wrt. RE I think it would be better to do something like this:
    ...
        boolean needNewSC = conditions ...
        this.configuration = configuration;
        register();
        if (needNewSC) {
                ... oldSC = sc;
                sc = null;
                oldSC.shutdown();
        }
        ...

AFAICS there is no need for synchronization and also no need to wait,
until the old one is actually down - just fire and forget.

BTW: It might be worth to think about lazy init of the SC as well, i.e.
just set the poolsize and init the threadpool/factory when it is needed
for the first time... 
Also I would avoid syncing on (this) - sooner or later it strikes back -
syncing on a separate object/using real locking is usually safer ...

...
> I always prefer to qualify with this access to fields because it 
> distinguishes locals from fields immediately, but I'll remove them if 
> it's against the convention.

Yes, I understand. Same for me wrt. tabs vs. spaces ;-)

> >   b) Lines 103-108: Why public? IMHO there is no need/is dangerous to
> >      expose these impementation details ...
> 
> An oversight.
Still?

> >   c) Lines 107,108: Why is this one needed? If programmed cleanly,
> >      doesn't this list contain one entry, only? the IndexSearcher 
> > which is
> >      equal to this.searcher - i.e. we already have a reference to it 
> > ?
> 
> Not in the case of searching multiple projects.

Well, be a little bit ignorant: The "client" wants to have a handle to a
usable searcher from the cache for a list of directories - whether it is
a list of one entry or more doesn't matter. It is the task of the cache
to do the right thing. May be thinking of it as a filesystem layer
helps: the app asks the OS for a file handle it can work with - the app
doesn't care, whether the storage below is a mirror/raid5/ramdisk/bla
...

> The index searchers are 
> cached once per directory. So one searching multiple projects, one needs 
> to get the index searchers for each project. But the index searchers 
> cannot be combined (MultiSearcher is deprecated), so the index readers 
> of the index searchers are combined instead.

Yepp. So if it is to hard/complex to manage searchers, why not focus on
managing just the IndexReaders? AFAIK creating searchers is pretty cheap
as long as the IndexReaders are already there ... I would guess, this
makes the whole thing much simpler and gets rid of the bloat ...

> problem is the Lucene API is not very clean. If IndexSearcher were an 
> interface, we could return an IndexSearcher decorator from SearcherCache 
> that would hide all these cleanup details.

Can't follow. IndexSearcher has .getIndexReader() as well as
.getSubReaders() and if in doubt java has instanceof. What else would be
needed?

> I'll refactor this to have the SearcherCache return an IndexSearcher in 
> all cases (including multiple projects) alongside with an object for 
> releasing the resources.

Still looks odd. Why is it not possible to do:
        sc = RuntimeEnvironment.getInstance().getSearcherCache();
        searcher = sc.get(File ...);
and on destroy:
        sc.release(searcher);
?

...
> >   g) Lines 210..222: tribut to unclean design? Asking the
> >      SearcherCache for Searcher for certain dirs just to get a
> >      reference to its internally used IndexReaders to build a new
> >      Searcher from it - does this make any sense? Sounds for me from
> >      the back through the breast and also rises the question:
> >      Do we have a SearcherCache or not? Why does one need to bypass 
> > it
> >      from time to time?
> 
> Well, there's no way to combine the IndexSearchers without using 
> deprecated functionality,

Yepp, and the current version shows, how to do it w/o using deprecated
stuff ;-)

> so to search in multiple projects we need to 
> use the IndexReaders directly.

Yepp - let the cache do the work. The whole *WithCleanup() stuff seems
to be overhead.  

> We cannot cache IndexSearchers for 
> multiple projects. If we have 10 projects, we would potentially have to 
> store 1013 searchers for multiple project search --  ...
> . The penalty for creating an IndexSearcher is small compared to that 
> of creating an Indexreader, so this is not a big issue.

So obviously caching/managing Searchers is a not so good idea ;-)

> And sure, we *could* store the IndexReaders in a separate place and 
> don't fetch them via the IndexSearcher. It would be pointless IMO.

Why? Take the direct route!

> Maybe 
> you'd feel a little cleaner, but it would complicate the book keeping 
> unnecessarily :p

I would assume the opposite ;-)

...
> >   c) Lines 48-54: The current behavior is, that each request gets
> >      a new ThreadPool with max 2*(CPUs+1), if required. Obviously
...
> Obviously, I haven't done any benchmarking on the appropriate size of 

I didn't either. That's why would go with a dynamic pool for now.

> the thread pool. If the searches are not CPU bound it might (or it might 
> not) help to increase the pool substantially. I recognized the chosen 
> default could not be appropriate, which is why I added a configuration 
> option to have it set.

Ehmm not really. A _fixed size_ pool is still used. Why not let it
grow and shrink on demand to a certain degree?

> >   d) Lines 58..63: synchronizing newThread just because of getting a
> >      unique int is a bad choice. It should be unsynced and use an
> >      AtomicInteger instead.
> 
> This is nitpicking. Even if by chance some thread pool workers are 
...

Yes, but if it changes its nature just a little bit (e.g. by passing a
single different param, like max. poolsize), the picture really changes ...

...

> >      2) Does it make sense, that a http thread "forks" another thread
> >         and blocks until it has finished its work? I.e. the search
> >         within a single dir (project) should always be done by the
> >         [http] thread itself, w/o the use of another thread ...
> 
> You have to several threads if you want to be able to search several 
> index segments at the same time.

A little hint: -O option is your friend :)
http://src.iws.cs.ovgu.de/source/xref/opengrok-jel/src/org/opensolaris/opengrok/index/IndexDatabase.java#472

> >      3) Since the whole thing is to avoid unnecessary Searcher/Reader
> >         creation, shouldn't be a more clever algorithm be used to 
> > obtain
> >         SearchManagers/maintain the map?
> >         I think it would be better to use a normal map and a
> >         ReentrantLock -> lookup again -> on fail create and add
> >         instead of blindly creating and eventuelly discarding a
> >         SM+Searcher+Reader ...
> 
> If the planets align and two or more people search on the same project, 
> which mustn't have been searched on before, at the same time, it would 
> be possible that you would discard an IndexSearcher.

Ehmm, nope.

> What you're saying 
> is that you want to introduce a penalty (using a lock) on EVERY search 

Ehmm, nope. Only if sm == null !

> against a very unlikely one time hit (it is a one time hit because once 
> the IndexSearcher is cached, the condition cannot happen).

Well, think a little bit further. There are people, which have a whole
bunch of projects, whereby there are ones, which get used very
frequently and some or more (may be large onbes), which are used once a
day/month..., only. So a clever cache could indeed discard the latter
e.g. on memory pressure or periodically...
Also, if a concurrent hashmap gets used, one is using a lock implicitly
anyway ...

> >    g) Last but not least, what happens, if the configuration (data 
> > root)
> >       changes?
> 
> Nothing catastrophic. The current searchers are leaked. We could 
> destroy the SearcherCache on configuration reload though. That would 
> also allow easily changing the thread pool size on runtime.

Yepp. BTW: When I see a finalize method in a normal application, my bells
start ringing loudly. Most of the time it is a sign of unclean
management/design problems ...

> > A better way to implement?
> >   a) SearcherCache: could be either a singleton -
> >      SearcherCache.getInstance() - or a "normal" Instance managed by 
> > the
> >      RuntimeEnvironment (RE) - the easy way wrt. management, e.g.:
> 
> Well, there's a reason people don't use the singleton antipattern 
> anymore.

Ohohoh, not that fast! Singletons make indeed sense and are used havyly
by any developer, who know how and when to use them! Maybe it helps to
think of it as a service used by several [unrelated] entities ...

> Among other things, it's a violation of the separation of 
> concerns

The opposite is the case: E.g. why should each application, which does
some internet stuff, implement its own DNS resolver? Or why should one,
who just wants to do some XML parsing implement its own XML*Factory, or
have its own instance of it, when it is doing always the same stuff, no
matter who is bothering it?

> -- a class should not manage its instances.

Singleton == single instance    for all objects, which use the same
classloader ...

> Since the lifetime 
> of the SearcherCache is associated with that of the application, the 
> instance is managed via WebappListener.

Think bigger. Don't put constraints on it, when they are not needed at
all (or make life even harder). Try to write reusable stuff ...

> It's also possible to have it 
> associated to the lifetime of the configuration, as you advocate. The 

Well, I haven't even excluded to have one cache for the complete JVM ;-)

> reason I didn't have RuntimeEnvironment manage the cache is because I 
> though it was to be used exclusively for dealing with the configuration 
> (being in the package it is and all).

Hmmm, the main task of such a cache is actually to deal with indexes
(n directories). So all it needs to know is, what directory contains,
what the client is interested in. And this info gets passed by the
client straight through.
How it gets configured (if at all), is a different story. The current
approach is what you said, but it would not be a big deal to configure it
via a JVM property or JMX or different webapp or CLI ... as well.

...
> You're not considering the fact that using thread pools are 
> advantageous even with a single index, as several segments can be read 
> simultaneously if you use a thread pool.

See -O option. Also, AFAIK there is no index splitter, which perhaps
allows proper loadbalancing ...

> >      - the sm.maybeRefresh() part can be ommitted, if the cache gets
> >        shutdowned/a new one is used, when the RE.config gets changed
> 
> I'm not sure I agree with this. When you're updating the indexes for 
> several projects it can take some time for it to finish and the 
> configuration to be reloaded.

Well, not sure what you mean. At the moment we "reloading" the indices all
the time ;-)

> I wouldn't want to be using the old 
> indexes all that time.

Of course not.

...
> >    b) - SearchHelper should get a ref to the SC either via
> >         SearcherCache.getInstance() or RE.getSearcherCache()
> >       - the only thing it does, is sc.getSearcher(File ....) and
> >       - on destroy: sc.release(searcher)
> 
> This is not really possible because the SearcherCache can't get to the 
> manager just by using the IndexSearcher. See above on how I plan to 
> address this.

So the problem seems to be the SearcherManager itself? Is it perhaps
wrong/too simple tool for the right purpose?
When looking at it, I'm wondering, why it is needed at all - this is
exactly, what the cache should do by itself - perhaps a little bit
more intelligent ...
http://src.iws.cs.ovgu.de/source/xref/lucene-3.6.0/lucene/core/src/java/org/apache/lucene/search/SearcherManager.java

But as said, there isn't probably much benefit to manage searchers,
just dealing with IndexReaders and taking out the boiler plate code from
the clients (create/destroy searchers) should be sufficient and make
life much easier ...

Last but not least: Perhaps it is the right time to get the RuntimeEnvironment
a shutdown method as well?

Have fun,
jel.
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768
_______________________________________________
opengrok-dev mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opengrok-dev

Re: [opengrok-dev] [Bug 19215] OpenGrok leaks file descriptors

Reply via email to