[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Michael McCandless (JIRA) Fri, 19 Jan 2007 14:09:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466165
 ]


Michael McCandless commented on LUCENE-710:
-------------------------------------------


Doron Cohen wrote:

> This ties solving the NFS issue with an extendable-file-deletion policy.
> I am wondering is this the right way, or, perhaps, should the reference 
> counting be considered alone, apart from the deletion policy.
> (Would modifying IndexFileDeleter to base on ref-count make it simpler
> or harder to maintain?)
>
> Also, IndexFileDeleter is doing delicate work - not sure you want 
> applications to mess with it. Better let applications control some
> simple well defined behavior, maybe the same way that a sorter 
> allows applications to provide a comparator, but keeps the sorting 
> algorithm for itself.

The solution I have in mind abstracts away all tricky details of
deleting files.  EG something like:

  public class OnlyLastCommitDeleter extends IndexFileDeleter {

    void onInit(List commits) {
      onCommit(commits);
    }

    void onCommit(List commits) {
      if (commits.size() > 1) {
        for(int i=0;i<commits.size()-1;i++) {
          deleteCommit(commits.get(i));
        }
      }
    }

Ie, the sole responsibility of the IndexFileDeleter subclass (policy)
is to decide when to delete a commit.  The rest of the details
(figuring out what actual files can be deleted now that a given commit
segments_N is deleted) are handled by the base class (with in-memory
ref counting).

> Back to reference counting,- how about the following approach:
> - Add to Directory a FileReferenceCounter data member, get()/set() etc.
> - Add a class FileReferenceCounter with simple general methods:
>   void increment (String name)
>   void decrement (String name)
>   int getRefCount (String name)
> - Default implementation would do nothing, i.e. would not record 
>   references, and always return 0.
> - IndexReader, upon opening a segment, would call increment(segName)
> - IndexReader, upon closing a segment, would call decrement(segName)
> - IndexFileDeleter, before removing a file belonging to a certain segment, 
>   would verify getRefCount(segName)==0.
> - Notice that the FilereferenceCounter is available from the Directory, 
>   so no constructors should be added to IndexWriter/Reader.
> 
> So, this is adding to Directory a general file utility, no knowledge of 
> index structure required in Directory. Also, IndexFileDeleter can remain 
> as today, and at some later point can be made more powerful with various 
> deletion policies - but those policies remain unrelated to the NFS 
> issue - they can focus on point-in-time issues, where I think it 
> stemmed from. 
> 
> An NFS geared FileReferenceCounter would then be able to keep alive 
> "counter files", name those files based on counted fileName plus
> processID plus machID, base getRefCount on safety window since file 
> was last touched, etc. All this is left out from point-in-time 
> policies (how many/time points-in-time should be retained).

I think this approach could work, but, rather than implementing in the
Lucene core (adding methods to Directory) I'd like to see it tested as
a custom deletion policy + wrappers around IndexReader
creation/destruction.

We have so much debate about the best "deletion policy" for NFS that
I'd like to make the minimal extension to the core (ability to make
your own "deletion policy") and then people can build their own and
try them out.

Mike

> Implement "point in time" searching without relying on filesystem semantics
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-710
>                 URL: https://issues.apache.org/jira/browse/LUCENE-710
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This was touched on in recent discussion on dev list:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700
> and then more recently on the user list:
>   http://www.gossamer-threads.com/lists/lucene/java-user/42088
> Lucene's "point in time" searching currently relies on how the
> underlying storage handles deletion files that are held open for
> reading.
> This is highly variable across filesystems.  For example, UNIX-like
> filesystems usually do "close on last delete", and Windows filesystem
> typically refuses to delete a file open for reading (so Lucene retries
> later).  But NFS just removes the file out from under the reader, and
> for that reason "point in time" searching doesn't work on NFS
> (see LUCENE-673 ).
> With the lockless commits changes (LUCENE-701 ), it's quite simple to
> re-implement "point in time searching" so as to not rely on filesystem
> semantics: we can just keep more than the last segments_N file (as
> well as all files they reference).
> This is also in keeping with the design goal of "rely on as little as
> possible from the filesystem".  EG with lockless we no longer re-use
> filenames (don't rely on filesystem cache being coherent) and we no
> longer use file renaming (because on Windows it can fails).  This
> would be another step of not relying on semantics of "deleting open
> files".  The less we require from filesystem the more portable Lucene
> will be!
> Where it gets interesting is what "policy" we would then use for
> removing segments_N files.  The policy now is "remove all but the last
> one".  I think we would keep this policy as the default.  Then you
> could imagine other policies:
>   * Keep past N day's worth
>   * Keep the last N
>   * Keep only those in active use by a reader somewhere (note: tricky
>     how to reliably figure this out when readers have crashed, etc.)
>   * Keep those "marked" as rollback points by some transaction, or
>     marked explicitly as a "snaphshot".
>   * Or, roll your own: the "policy" would be an interface or abstract
>     class and you could make your own implementation.
> I think for this issue we could just create the framework
> (interface/abstract class for "policy" and invoke it from
> IndexFileDeleter) and then implement the current policy (delete all
> but most recent segments_N) as the default policy.
> In separate issue(s) we could then create the above more interesting
> policies.
> I think there are some important advantages to doing this:
>   * "Point in time" searching would work on NFS (it doesn't now
>     because NFS doesn't do "delete on last close"; see LUCENE-673 )
>     and any other Directory implementations that don't work
>     currently.
>   * Transactional semantics become a possibility: you can set a
>     snapshot, do a bunch of stuff to your index, and then rollback to
>     the snapshot at a later time.
>   * If a reader crashes or machine gets rebooted, etc, it could choose
>     to re-open the snapshot it had previously been using, whereas now
>     the reader must always switch to the last commit point.
>   * Searchers could search the same snapshot for follow-on actions.
>     Meaning, user does search, then next page, drill down (Solr),
>     drill up, etc.  These are each separate trips to the server and if
>     searcher has been re-opened, user can get inconsistent results (=
>     lost trust).  But with, one series of search interactions could
>     explicitly stay on the snapshot it had started with.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Reply via email to