[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Marvin Humphrey (JIRA) Sun, 21 Jan 2007 17:23:55 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466377
 ]

Marvin Humphrey commented on LUCENE-710:
----------------------------------------

On Jan 19, 2007, at 2:04 PM, Michael McCandless (JIRA) wrote:
>   * I think NFS support is part of Lucene's core mission.

When I asserted that IndexFileDeleter had nothing to do with Lucene's core
mission, I meant: you don't use Lucene to build yourself an app which helps
you delete files.  

> Yes there is an open question now on what to do about the confusion on using
> IndexReader vs IndexWriter.  I think moving towards "use IndexWriter for
> changes, use IndexReader for reading" is the best solution here.  But I
> don't see how this relates to allowing subclassing of IndexFileDeleter to
> make your own deletion policy.

They're hard to refactor because they're both big, hence adding either code or
API commitments to them should be avoided when possible.  We're in agreement
about the desirability of simplicity.  We part ways in how we measure
simplicity: I give greater emphasis to simplicity of API design. 

> I disagree on this point ("no" API is better than subclassing). 

We're talking past each other.  I was generalizing: a 100% successful, purely
internal "black box" solution is always better than a solution that involves
the user.

> I would not want to add file locking & additional complexity into the Lucene
> core, just to handle NFS.

This is where our differing priorities manifest.  I would rather add some
code complexity to the Lucene "core" than accept the increased 
support burden of an expanded API.

Ironically, though you consider supporting NFS "part of Lucene's core
mission", for the average user your proposal as it stands is not very
user-friendly.  People like Chuck, Doron, and Robert will have no trouble with
it, but if you're a newcomer to Lucene and you "just want to put an index on
NFS", subclassing IndexFileDeleter will pose a challenge.

I also think you may be over-estimating the amount of effort it will take to
exploit advisory read locks.  (The vexing problem of how to issue a warning 
when index creation is attempted on an NFS volume is orthogonal to the 
read-locks approach as well.) They should be easy in KS; I'll know soon enough. 

However, there are some OO discipline issues which complicate applying what I
have in mind to Java Lucene.  In KS, the public API is defined solely via
documentation, so I can have code in Index call methods from Store without
having to expose it.  With Lucene divided into multiple packages, that's a
problem.  

> I think subclassing is perfect for this sort of situation.  

I'm not so enamored of subclassing. It's less constraining than some other
approaches, but it's still constraining.

Case in point: it's not possible to provide a subclass of IndexFileDeleter
which exploits advisory read locking under your current proposal.

In theory, your proposal even prevents the addition of such read locks to
Lucene later, because doing so could conflict with a deletions policy you've
allowed the user to set.  (; Given that locks over NFS make you "nervous",
perhaps you consider foreclosing that option a feature. ;)

> It's like the various LockFactory implementations we have: there is no "one
> size fits all".

I don't think LockFactory ought to be exposed either.  :)

Reading from an index -- any index, on any volume -- should Just Work.
Writing to an index from a single IndexWriter should Just Work.  In a perfect
world, writing from multiple writers simultaneously would Just Work, too, but
as that's inherently impractical given Lucene's current design, opening a
second writer must fail.  That failure should be the only visible evidence
that a locking mechanism even exists.  

In my view, any deviance from that ideal API due to implementation defects
should be handled with exceptions rather than API additions.

In keeping with this philosophy, Lock is not publicly exposed in KinoSearch.
In fact, nothing about the locking mechanism is publicly exposed.  So far,
there seem to be three bugs with the current implementation: 

  * Stale NFS Filehandle exceptions.
  * Stale lock files interfering with unattended indexing sessions.  I plan 
    to mitigate this by moving to advisory write locks when possible.
  * Multiple machines can cause index corruption when attempting to write 
    simultaneously to a shared volume.  Moving the write.lock file to the
    index directory, as enabled by lockless commits, solves this problem.

Once advisory file locks are in place, and if they work as expected under
recent versions of NFS, I expect no further problems under any common, recent
Unix.

With the switch to lockless commits in KinoSearch, I've able to refactor Lock
and eliminate all of Lock's subclasses, simplifying the KinoSearch "core".
"No more subclassing of Lock" was originally a line-item in my list of
"great stuff" about lockless commits, but I had to take it out because it
wasn't true for Lucene!

With Otis signing on to your solution, it looks like momentum is gathering for
it.  For the record, I don't think it's a catastrophic change, just suboptimal
and IMO not ready for addition until improved.  

I think you can do better.

> Implement "point in time" searching without relying on filesystem semantics
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-710
>                 URL: https://issues.apache.org/jira/browse/LUCENE-710
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This was touched on in recent discussion on dev list:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700
> and then more recently on the user list:
>   http://www.gossamer-threads.com/lists/lucene/java-user/42088
> Lucene's "point in time" searching currently relies on how the
> underlying storage handles deletion files that are held open for
> reading.
> This is highly variable across filesystems.  For example, UNIX-like
> filesystems usually do "close on last delete", and Windows filesystem
> typically refuses to delete a file open for reading (so Lucene retries
> later).  But NFS just removes the file out from under the reader, and
> for that reason "point in time" searching doesn't work on NFS
> (see LUCENE-673 ).
> With the lockless commits changes (LUCENE-701 ), it's quite simple to
> re-implement "point in time searching" so as to not rely on filesystem
> semantics: we can just keep more than the last segments_N file (as
> well as all files they reference).
> This is also in keeping with the design goal of "rely on as little as
> possible from the filesystem".  EG with lockless we no longer re-use
> filenames (don't rely on filesystem cache being coherent) and we no
> longer use file renaming (because on Windows it can fails).  This
> would be another step of not relying on semantics of "deleting open
> files".  The less we require from filesystem the more portable Lucene
> will be!
> Where it gets interesting is what "policy" we would then use for
> removing segments_N files.  The policy now is "remove all but the last
> one".  I think we would keep this policy as the default.  Then you
> could imagine other policies:
>   * Keep past N day's worth
>   * Keep the last N
>   * Keep only those in active use by a reader somewhere (note: tricky
>     how to reliably figure this out when readers have crashed, etc.)
>   * Keep those "marked" as rollback points by some transaction, or
>     marked explicitly as a "snaphshot".
>   * Or, roll your own: the "policy" would be an interface or abstract
>     class and you could make your own implementation.
> I think for this issue we could just create the framework
> (interface/abstract class for "policy" and invoke it from
> IndexFileDeleter) and then implement the current policy (delete all
> but most recent segments_N) as the default policy.
> In separate issue(s) we could then create the above more interesting
> policies.
> I think there are some important advantages to doing this:
>   * "Point in time" searching would work on NFS (it doesn't now
>     because NFS doesn't do "delete on last close"; see LUCENE-673 )
>     and any other Directory implementations that don't work
>     currently.
>   * Transactional semantics become a possibility: you can set a
>     snapshot, do a bunch of stuff to your index, and then rollback to
>     the snapshot at a later time.
>   * If a reader crashes or machine gets rebooted, etc, it could choose
>     to re-open the snapshot it had previously been using, whereas now
>     the reader must always switch to the last commit point.
>   * Searchers could search the same snapshot for follow-on actions.
>     Meaning, user does search, then next page, drill down (Solr),
>     drill up, etc.  These are each separate trips to the server and if
>     searcher has been re-opened, user can get inconsistent results (=
>     lost trust).  But with, one series of search interactions could
>     explicitly stay on the snapshot it had started with.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Reply via email to