[ 
https://issues.apache.org/jira/browse/LUCENE-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565467#comment-13565467
 ] 

Shai Erera commented on LUCENE-4731:
------------------------------------

Did you take a look at LUCENE-2632 (TeeDirectory)? I think it's similar to what 
you need? Perhaps you can compare the two?

Hmmm .. but the approach you've taken here is different. While TeeDirectory 
mimics Unix's "tee" and forwards calls to two directories, ReplicationDirectory 
implements ... replication.

I would not implement replication at the level of Directory, and rely on things 
like "when segments.gen is seen we know commit happened". It sounds too fragile 
of a protocol to me.

Perhaps instead you can think of a replication module which lets a producer 
publish IndexCommits whenever it called commit(), and consumers can 
periodically poll the replicator for updates, giving it their current state. 
When an update is available, they do the replication? Or something along those 
lines? IndexCommits are much more "official" to rely on, than the fragile 
algorithm you describe. For example, You can use SnapshotDeletionPolicy to hold 
onto IndexCommits that are currently being replicated, which will prevent the 
deletion of their files. Whereas in your algorithm, if two commits are called 
close to each other, one thread could start a replication action, while the 
next commit will delete the files in the middle of copy, or just delete some of 
the files that haven't been copied yet.

I think what we need in Lucene is a Replicator module :).
                
> New ReplicatingDirectory mirrors index files to HDFS
> ----------------------------------------------------
>
>                 Key: LUCENE-4731
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4731
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/store
>            Reporter: David Arthur
>             Fix For: 4.2, 5.0
>
>         Attachments: ReplicatingDirectory.java
>
>
> I've been working on a Directory implementation that mirrors the index files 
> to HDFS (or other Hadoop supported FileSystem).
> A ReplicatingDirectory delegates all calls to an underlying Directory 
> (supplied in the constructor). The only hooks are the deleteFile and sync 
> calls. We submit deletes and replications to a single scheduler thread to 
> keep things serializer. During a sync call, if "segments.gen" is seen in the 
> list of files, we know a commit is finishing. After calling the deletage's 
> sync method, we initialize an asynchronous replication as follows.
> * Read segments.gen (before leaving ReplicatingDirectory#sync), save the 
> values for later
> * Get a list of local files from ReplicatingDirectory#listAll before leaving 
> ReplicatingDirectory#sync
> * Submit replication task (DirectoryReplicator) to scheduler thread
> * Compare local files to remote files, determine which remote files get 
> deleted, and which need to get copied
> * Submit a thread to copy each file (one thead per file)
> * Submit a thread to delete each file (one thead per file)
> * Submit a "finalizer" thread. This thread waits on the previous two batches 
> of threads to finish. Once finished, this thread generates a new 
> "segments.gen" remotely (using the version and generation number previously 
> read in).
> I have no idea where this would belong in the Lucene project, so i'll just 
> attach the standalone class instead of a patch. It introduces dependencies on 
> Hadoop core (and all the deps that brings with it).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to