[ 
https://issues.apache.org/jira/browse/LUCENE-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13567396#comment-13567396
 ] 

Shai Erera commented on LUCENE-4731:
------------------------------------

I see, so you only allow one commit at a time. That's not great either ... e.g. 
if the replicating thread copies a large index commit (due to merges or 
something), all other processes are stopped until it finishes. This makes 
indexing on Hadoop even more horrible (if such thing is possible :)).

You don't have to do pull requests, you can have an agent running on the Hadoop 
cluster (where MapReduce jobs are run) that will poll the index directory 
periodically and then push the files to HDFS. The difference is that it will:

* Take a snapshot on the index, so those files that it copies won't get deleted 
for sure.
* It does not block indexing operations. If it copies a large index commit, and 
few commits are made in parallel by the indexing process, then when the 
replication process finishes, it will copy a single index commit with all 
recent changes. That might even make it more efficient.
* You don't rely on a fragile algorithm, e.g. the detection of segments.gen.
                
> New ReplicatingDirectory mirrors index files to HDFS
> ----------------------------------------------------
>
>                 Key: LUCENE-4731
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4731
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/store
>            Reporter: David Arthur
>             Fix For: 4.2, 5.0
>
>         Attachments: ReplicatingDirectory.java
>
>
> I've been working on a Directory implementation that mirrors the index files 
> to HDFS (or other Hadoop supported FileSystem).
> A ReplicatingDirectory delegates all calls to an underlying Directory 
> (supplied in the constructor). The only hooks are the deleteFile and sync 
> calls. We submit deletes and replications to a single scheduler thread to 
> keep things serializer. During a sync call, if "segments.gen" is seen in the 
> list of files, we know a commit is finishing. After calling the deletage's 
> sync method, we initialize an asynchronous replication as follows.
> * Read segments.gen (before leaving ReplicatingDirectory#sync), save the 
> values for later
> * Get a list of local files from ReplicatingDirectory#listAll before leaving 
> ReplicatingDirectory#sync
> * Submit replication task (DirectoryReplicator) to scheduler thread
> * Compare local files to remote files, determine which remote files get 
> deleted, and which need to get copied
> * Submit a thread to copy each file (one thead per file)
> * Submit a thread to delete each file (one thead per file)
> * Submit a "finalizer" thread. This thread waits on the previous two batches 
> of threads to finish. Once finished, this thread generates a new 
> "segments.gen" remotely (using the version and generation number previously 
> read in).
> I have no idea where this would belong in the Lucene project, so i'll just 
> attach the standalone class instead of a patch. It introduces dependencies on 
> Hadoop core (and all the deps that brings with it).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to