On Fri, Mar 27, 2009 at 08:28:50AM -0400, Michael McCandless wrote:

> Are more than one thread allowed inside the Lucy core at once?

I would like to.  However, I think it's important to make three points up
front. 

  1) Concurrency is hard.  Even in languages with comparatively good support
     for threads like Java, threads programming is a bug-spawning developer
     timesuck.
  2) We will not be able to abstract multiple host threading models so that we
     can make sophisticated use of them in the Lucy core.
  3) Multiple processes will always be available -- but threads won't.

For those reasons, in my opinion we should keep our threading ambitions to a
minimum.

I think we should have two priorities:

  1) Don't break the host's threading model.
  2) Make it possible to exploit threads in a limited way during search.

> Or are we "up-front" expecting one to always use separate processes to
> gain concurrency?

Fortunately, thanks to mmap, we are going to be able to make excellent use of
multiple processes.  If we had no choice but to read index caches into process
memory every time a la Java Lucene, we would have far more motivation to rely
on threads within a single process as our primary concurrency model.  

For indexing, I thing we should make it possible to support concurrency using
multiple indexer *objects*.  Whether those multiple indexer objects get used
within a single multi-threaded app, or within separate processes shouldn't be
important.  However, I think it's very important that we not *require* threads
to exploit concurrency at index-time.

For searching, I think we have no choice.  There are certain things which
cannot be achieved using a process-based concurrency model because portable
IPC techniques are too crude -- e.g. HitCollector-based scoring routines.

> Whichever it is, Lucy will need to do something when crossing the
> bridge to "mate" to the Host language's thread model.

I think what we're going to have to do is issue a callback to the Host
whenever multiple threads might be launched, and wait for that call to return
after all threads have concluded their work.

In a multi-threaded Host, several threads might run in parallel.  In a
single-threaded Host, the threaded calls will run sequentially.

> At some point, as described above, a single search will need to use
> concurrency; it seems like Lucy should allow multiple threads into the
> core for this reason.

I think we have no choice but to allow threads during search in order to
exploit multiple processors and return the answer to a query as fast as
possible.

Mike, I know you would prefer not to tie the index format to our concurrency
model, but I think a one-thread-per-segment scoring model makes a lot of
sense.  Using skipping info could work with core classes, but there's tension
between that and making it easy to write plugins:  

It's easy to tell a plugin "skip to the next segment".  (In fact, I think we
might consider making all Scorers single-segment only.)  It's hard to require
that all Scorer and DataReader subclasses implement intra-segment skipping
support.

In order to support multi-threaded search for custom index components, I think
we should adopt a segment-based model and adjust our index optimization
APIs and algorithms to fit that model.

---

I should also note that my personal priority regarding threads has been and
remains to avoid foreclosing on the option of using them.  However, I'm
working in a single-threaded environment right now, and I don't have the means
to test my code for thread-safety.

The first module we'd have to work on to make Lucy safe for threads would be
Lucy::Util::Hash, which is used to associate class names with VTable
instances.  However, I'm not going to delay submitting that module for the
sake of making it thread-safe first.

Marvin Humphrey


Reply via email to