[ https://issues.apache.org/jira/browse/LUCENE-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565467#comment-13565467 ]
Shai Erera commented on LUCENE-4731: ------------------------------------ Did you take a look at LUCENE-2632 (TeeDirectory)? I think it's similar to what you need? Perhaps you can compare the two? Hmmm .. but the approach you've taken here is different. While TeeDirectory mimics Unix's "tee" and forwards calls to two directories, ReplicationDirectory implements ... replication. I would not implement replication at the level of Directory, and rely on things like "when segments.gen is seen we know commit happened". It sounds too fragile of a protocol to me. Perhaps instead you can think of a replication module which lets a producer publish IndexCommits whenever it called commit(), and consumers can periodically poll the replicator for updates, giving it their current state. When an update is available, they do the replication? Or something along those lines? IndexCommits are much more "official" to rely on, than the fragile algorithm you describe. For example, You can use SnapshotDeletionPolicy to hold onto IndexCommits that are currently being replicated, which will prevent the deletion of their files. Whereas in your algorithm, if two commits are called close to each other, one thread could start a replication action, while the next commit will delete the files in the middle of copy, or just delete some of the files that haven't been copied yet. I think what we need in Lucene is a Replicator module :). > New ReplicatingDirectory mirrors index files to HDFS > ---------------------------------------------------- > > Key: LUCENE-4731 > URL: https://issues.apache.org/jira/browse/LUCENE-4731 > Project: Lucene - Core > Issue Type: New Feature > Components: core/store > Reporter: David Arthur > Fix For: 4.2, 5.0 > > Attachments: ReplicatingDirectory.java > > > I've been working on a Directory implementation that mirrors the index files > to HDFS (or other Hadoop supported FileSystem). > A ReplicatingDirectory delegates all calls to an underlying Directory > (supplied in the constructor). The only hooks are the deleteFile and sync > calls. We submit deletes and replications to a single scheduler thread to > keep things serializer. During a sync call, if "segments.gen" is seen in the > list of files, we know a commit is finishing. After calling the deletage's > sync method, we initialize an asynchronous replication as follows. > * Read segments.gen (before leaving ReplicatingDirectory#sync), save the > values for later > * Get a list of local files from ReplicatingDirectory#listAll before leaving > ReplicatingDirectory#sync > * Submit replication task (DirectoryReplicator) to scheduler thread > * Compare local files to remote files, determine which remote files get > deleted, and which need to get copied > * Submit a thread to copy each file (one thead per file) > * Submit a thread to delete each file (one thead per file) > * Submit a "finalizer" thread. This thread waits on the previous two batches > of threads to finish. Once finished, this thread generates a new > "segments.gen" remotely (using the version and generation number previously > read in). > I have no idea where this would belong in the Lucene project, so i'll just > attach the standalone class instead of a patch. It introduces dependencies on > Hadoop core (and all the deps that brings with it). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org