Jens

thanks for the review & I will pull your short fix for now, if you don't object ;) But I still like the idea of searchers cache and we should add it - Gustavo, will you be able to address Jenses comments? I also think we should probably have the algorithm that counts how many threads to produce in one place - I think we use it for indexing, too and maybe we should align them, or just have them in one place - after all thread pool size is a "configuration variable".

(the threadpool or better usage of new searching - this is something I should probably investigate when moving to latest lucene - but I just had time to upgrade, make lucene compile and download itself (so we don't polute .hg dir) and didn't investigate all new additions ... the addition of assert failures was enough of a lucene surprise for me :( but that's why we're in dev version to figure all this out and stabilize before release )

thank you
Lubos

P.S. I have a working prototype of tika integration and a sample pdf parser so I am figuring out how to get this into opengrok so we can use the engine even more ;)

On 11.6.2012 7:41, Jens Elkner wrote:
On Fri, Jun 08, 2012 at 02:24:32PM +0000, [email protected] wrote:
http://defect.opensolaris.org/bz/show_bug.cgi?id=19215

--- Comment #13 from Gustavo Lopes <[email protected]> 2012-06-08 14:24:28 
UTC ---
Make that

https://bitbucket.org/cataphract/opengrok/compare/bug19215..ce0b759

now that I've updated the default branch.
Hmmm, strange, the PageConfig.java diff of that URL doesn't show up on
my desktops (probably because of a 403 error) ...  Anyway, I cloned this
repo and made a webrev against 1385:1387 to have something referrable: see
http://iws.cs.uni-magdeburg.de/~elkner/tmp/gustavo/

But first: The main problem wrt. leaks is, that the Searchers do not
auto-close the used IndexReaders/Executors anymore. So to restore the
old behavior (fix the bug), something like this should be sufficient:
http://src.iws.cs.ovgu.de/source/diff/opengrok-jel/src/org/opensolaris/opengrok/web/SearchHelper.java?r1=/opengrok-jel/src/org/opensolaris/opengrok/web/SearchHelper.java@1385:1c62bc7908a0&r2=/opengrok-jel/src/org/opensolaris/opengrok/web/SearchHelper.java@1424:504c5781fb6a&format=u&full=0

Wrt. to the webrev/Gustvos patches:

1) src/org/opensolaris/opengrok/web/WebappListener.java:
   a) Shouldn't runtime related settings/instanca be fetched via the
      RuntimeEnvironment, e.g. like
      
RuntimeEnvironment.getInstance()[.getConfiguration()].getSearchPool[Size]()
      Furthermore, even if obtained via RuntimeEnvironment, shouldn't lazy
      instantiation should be preferred?
      IMHO: no changes required here

2) web/search.jsp
   a) The SearchHelper should not have have any dependencies on a webapp/
      servlet context/its consumer - it should be usable in a "standalone"
      app in the same manner. As shown above, there is no need for such a
      reference.  So IMHO no changes required here as well.

3) src/org/opensolaris/opengrok/web/PageConfig.java
   a) same thing - no changes required at all.

4) src/org/opensolaris/opengrok/configuration/RuntimeEnvironment.java
   a) line 41: import not needed - no changes required

5) src/org/opensolaris/opengrok/configuration/Configuration.java:
   - line 195: a little nit - not required, since globals are always
     initialized to 0/null

6) src/org/opensolaris/opengrok/web/SearchHelper.java
   Changes look IMHO a little bit chaotic:
   a) "this." prefix == overhead - shouldn't be used if not required
      (readability, unwritten project code convention)
   b) Lines 103-108: Why public? IMHO there is no need/is dangerous to
      expose these impementation details ...
   c) Lines 107,108: Why is this one needed? If programmed cleanly,
      doesn't this list contain one entry, only? the IndexSearcher which is
      equal to this.searcher - i.e. we already have a reference to it ?
   d) Lines 103..105: Why is this needed here? SearcherManagers
      are acquired from the "SearcherCache" based on the index dir.
      Why is it not possible to aquire the needed Searcher directly from
      the "SearcherCache" based on the index dir?
      So IMHO, if such a list is needed, it should be part of the
      SearchManager. Isn't it pretty unclean/dangerous, to let an
      "outsider" manage/control the innards of the SearcherCache?
      IMHO better encapsulation is needed.
   e) Lines 133..143: not needed - see 1)
   f) Lines 184..227: as said in d) - any reference/occurance of
      SearcherManager looks like a design flaw - should go into
      "SearcherCache"
   g) Lines 210..222: tribut to unclean design? Asking the
      SearcherCache for Searcher for certain dirs just to get a
      reference to its internally used IndexReaders to build a new
      Searcher from it - does this make any sense? Sounds for me from
      the back through the breast and also rises the question:
      Do we have a SearcherCache or not? Why does one need to bypass it
      from time to time?
   h) Line 227: similar to g) - Why does the SearchHelper need to know
      anything about an executor, when it doesn't need it - the obvious
      owner/user/manager seem to be the SearcherCache?
   i) Lines 460-472: as said, seems to be odd as well, since a
      SearchHelper has one aka this.searcher, only ...

7) src/org/opensolaris/opengrok/search/SearcherCache.java
   a) big flaw: there is no way to shutdown this Cache. So a restart
      of the web app may cause big leaks as well ...
      Also resources can't be freed, even if there are no consumers
      for a long time ...
   b) Lines 70-72: the internally used threadPool, which is essential
      for proper working, should not be exposed to outsiders.
   c) Lines 48-54: The current behavior is, that each request gets
      a new ThreadPool with max 2*(CPUs+1), if required. Obviously
      not that smart. However, with the suggested change there are
      max. only 2*(CPUs+1) threads available for all requests. Depending
      on the load (number of concurrent requests and queried projects),
      this may actually result in blocking requests (kind of sequential
      behavior) and lead to timeouts wrt. answers. Not sure, what the
      search threads are actually doing, but may be a FixedThreadPool
      may cause more problems, than it solves. Unless no deeper
      research/experience was gathered, for now I would probably prefer a
      dynamic Thread Pool Executor with a core pool size of 2*(CPUs+1) and a
      max. pool size of e.g. 128 or 256 threads (depending on the number of
      projects, machine, ...). JavaConsole/JMX might be used to find out more 
...
   d) Lines 58..63: synchronizing newThread just because of getting a
      unique int is a bad choice. It should be unsynced and use an
      AtomicInteger instead.
   e) Line 64: a tribute to the missing shutdown method? ;-)
   f) Line 77..84:
      1) Why is it necessary to create a new SearcherFactory for each
         index dir? Shouldn't factories be singletons a priori?
      2) Does it make sense, that a http thread "forks" another thread
         and blocks until it has finished its work? I.e. the search
         within a single dir (project) should always be done by the
         [http] thread itself, w/o the use of another thread ...
      3) Since the whole thing is to avoid unnecessary Searcher/Reader
         creation, shouldn't be a more clever algorithm be used to obtain
         SearchManagers/maintain the map?
         I think it would be better to use a normal map and a
         ReentrantLock -> lookup again -> on fail create and add
         instead of blindly creating and eventuelly discarding a
         SM+Searcher+Reader ...
    g) Last but not least, what happens, if the configuration (data root)
       changes?

A better way to implement?
   a) SearcherCache: could be either a singleton -
      SearcherCache.getInstance() - or a "normal" Instance managed by the
      RuntimeEnvironment (RE) - the easy way wrt. management, e.g.:
      - RE.getSearcherCache() creates lazy a instance of the
        SearcherCache if not already done, keeps a ref to it and returns
        it
      - on RE.setConfiguration() [data root change] it may shutdown the
        SC and set the internal ref to null
      - there should be only one instance of a SearcherFactory be used
        by the SC
      - the SearcherFactory should create newSearcher with a threadpool
        arg only, if it has subreaders
      - the sm.maybeRefresh() part can be ommitted, if the cache gets
        shutdowned/a new one is used, when the RE.config gets changed
      - implementation details like threadpool, SM usage should be kept
        private
      - misc bla mentioned above
    b) - SearchHelper should get a ref to the SC either via
         SearcherCache.getInstance() or RE.getSearcherCache()
       - the only thing it does, is sc.getSearcher(File ....) and
       - on destroy: sc.release(searcher)
       NOTE: The implementation details, whether the SC only caches
       readers and creates each time a new Searcher on getSearcher(...)
       or caches created searchers, is completely hidden from the
       consumer (e.g. SearchHelper) and thus can be changed/tuned without any
       trouble if necessary ...

Hopefully I've catched most issues/problems ;-)

Cheers,
jel.


_______________________________________________
opengrok-dev mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opengrok-dev

Reply via email to