[ 
https://issues.apache.org/jira/browse/LUCENE-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566914#comment-13566914
 ] 

David Arthur commented on LUCENE-4731:
--------------------------------------

TeeDirectory was actually the inspiration for this. The primary difference is 
that I want to asynchronously copy the index files, rather than having two sync 
underlying Directories. The motivating use case for me is I want to run some 
Hadoop jobs that make use of my Lucene index, but I don't want to collocate 
Lucene and Hadoop (sounds like a recipe for bad performance all around). With 
this async strategy, commits will get to HDFS _eventually_, and I don't really 
care how far behind the lag as, so long as I have a readable commit in HDFS.

Also, regarding push vs pull, I'd rather push from Lucene to avoid having to 
deal with remote agents pulling.

bq. Whereas in your algorithm, if two commits are called close to each other, 
one thread could start a replication action, while the next commit will delete 
the files in the middle of copy, or just delete some of the files that haven't 
been copied yet.

"Replication actions" and "delete actions" are serialized by a single thread, 
so they will not be interleaved.
                
> New ReplicatingDirectory mirrors index files to HDFS
> ----------------------------------------------------
>
>                 Key: LUCENE-4731
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4731
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/store
>            Reporter: David Arthur
>             Fix For: 4.2, 5.0
>
>         Attachments: ReplicatingDirectory.java
>
>
> I've been working on a Directory implementation that mirrors the index files 
> to HDFS (or other Hadoop supported FileSystem).
> A ReplicatingDirectory delegates all calls to an underlying Directory 
> (supplied in the constructor). The only hooks are the deleteFile and sync 
> calls. We submit deletes and replications to a single scheduler thread to 
> keep things serializer. During a sync call, if "segments.gen" is seen in the 
> list of files, we know a commit is finishing. After calling the deletage's 
> sync method, we initialize an asynchronous replication as follows.
> * Read segments.gen (before leaving ReplicatingDirectory#sync), save the 
> values for later
> * Get a list of local files from ReplicatingDirectory#listAll before leaving 
> ReplicatingDirectory#sync
> * Submit replication task (DirectoryReplicator) to scheduler thread
> * Compare local files to remote files, determine which remote files get 
> deleted, and which need to get copied
> * Submit a thread to copy each file (one thead per file)
> * Submit a thread to delete each file (one thead per file)
> * Submit a "finalizer" thread. This thread waits on the previous two batches 
> of threads to finish. Once finished, this thread generates a new 
> "segments.gen" remotely (using the version and generation number previously 
> read in).
> I have no idea where this would belong in the Lucene project, so i'll just 
> attach the standalone class instead of a patch. It introduces dependencies on 
> Hadoop core (and all the deps that brings with it).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to