I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this?
J.D. On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote: > > There seem to be a few other players in this space too. > > Are you from Rackspace? > (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- > query-terabytes-data) > > AOL also has a Hadoop/Solr project going on. > > CNET does not have much brewing there. Although Yonik and I had > talked about it a bunch -- but that was long ago. > > --cw > > Clay Webster tel:1.908.541.3724 > Associate VP, Platform Infrastructure http://www.cnet.com > CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] > > > > -----Original Message----- > > From: Ning Li [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, February 06, 2008 1:57 PM > > To: general@lucene.apache.org; [EMAIL PROTECTED]; solr- > > [EMAIL PROTECTED] > > Subject: Lucene-based Distributed Index Leveraging Hadoop > > > > There have been several proposals for a Lucene-based distributed index > > architecture. > > 1) Doug Cutting's "Index Server Project Proposal" at > > > http://www.mail-archive.com/general@lucene.apache.org/msg00338.html > > 2) Solr's "Distributed Search" at > > http://wiki.apache.org/solr/DistributedSearch > > 3) Mark Butler's "Distributed Lucene" at > > http://wiki.apache.org/hadoop/DistributedLucene > > > > We have also been working on a Lucene-based distributed index > > architecture. > > Our design differs from the above proposals in the way it leverages > > Hadoop > > as much as possible. In particular, HDFS is used to reliably store > > Lucene > > instances, Map/Reduce is used to analyze documents and update Lucene > > instances > > in parallel, and Hadoop's IPC framework is used. Our design is geared > > for > > applications that require a highly scalable index and where batch > > updates > > to each Lucene instance are acceptable (verses finer-grained document > > at > > a time updates). > > > > We have a working implementation of our design and are in the process > > of evaluating its performance. An overview of our design is provided > > below. > > We welcome feedback and would like to know if you are interested in > > working > > on it. If so, we would be happy to make the code publicly available. > At > > the > > same time, we would like to collaborate with people working on > existing > > proposals and see if we can consolidate our efforts. > > > > TERMINOLOGY > > A distributed "index" is partitioned into "shards". Each shard > > corresponds > > to > > a Lucene instance and contains a disjoint subset of the documents in > > the > > index. > > Each shard is stored in HDFS and served by one or more "shard > servers". > > Here > > we only talk about a single distributed index, but in practice > multiple > > indexes > > can be supported. > > > > A "master" keeps track of the shard servers and the shards being > served > > by > > them. An "application" updates and queries the global index through an > > "index client". An index client communicates with the shard servers to > > execute a query. > > > > KEY RPC METHODS > > This section lists the key RPC methods in our design. To simplify the > > discussion, some of their parameters have been omitted. > > > > On the Shard Servers > > // Execute a query on this shard server's Lucene instance. > > // This method is called by an index client. > > SearchResults search(Query query); > > > > On the Master > > // Tell the master to update the shards, i.e., Lucene instances. > > // This method is called by an index client. > > boolean updateShards(Configuration conf); > > > > // Ask the master where the shards are located. > > // This method is called by an index client. > > LocatedShards getShardLocations(); > > > > // Send a heartbeat to the master. This method is called by a > > // shard server. In the response, the master informs the > > // shard server when to switch to a newer version of the index. > > ShardServerCommand sendHeartbeat(); > > > > QUERYING THE INDEX > > To query the index, an application sends a search request to an index > > client. > > The index client then calls the shard server search() method for each > > shard > > of the index, merges the results and returns them to the application. > > The > > index client caches the mapping between shards and shard servers by > > periodically calling the master's getShardLocations() method. > > > > UPDATING THE INDEX USING MAP/REDUCE > > To update the index, an application sends an update request to an > index > > client. > > The index client then calls the master's updateShards() method, which > > schedules > > a Map/Reduce job to update the index. The Map/Reduce job updates the > > shards > > in > > parallel and copies the new index files of each shard (i.e., Lucene > > instance) > > to HDFS. > > > > The updateShards() method includes a "configuration", which provides > > information for updating the shards. More specifically, the > > configuration > > includes the following information: > > - Input path. This provides the location of updated documents, e.g., > > HDFS > > files or directories, or HBase tables. > > - Input formatter. This specifies how to format the input documents. > > - Analysis. This defines the analyzer to use on the input. The > > analyzer > > determines whether a document is being inserted, updated, or > > deleted. > > For > > inserts or updates, the analyzer also converts each input document > > into > > a Lucene document. > > > > The Map phase of the Map/Reduce job formats and analyzes the input (in > > parallel), while the Reduce phase collects and applies the updates to > > each > > Lucene instance (again in parallel). The updates are applied using the > > local > > file system where a Reduce task runs and then copied back to HDFS. For > > example, > > if the updates caused a new Lucene segment to be created, the new > > segment > > would be created on the local file system first, and then copied back > > to > > HDFS. > > > > When the Map/Reduce job completes, a "new version" of the index is > > ready to > > be > > queried. It is important to note that the new version of the index is > > not > > derived from scratch. By leveraging Lucene's update algorithm, the new > > version > > of each Lucene instance will share as many files as possible as the > > previous > > version. > > > > ENSURING INDEX CONSISTENCY > > At any point in time, an index client always has a consistent view of > > the > > shards in the index. The results of a search query include either all > > or > > none > > of a recent update to the index. The details of the algorithm to > > accomplish > > this are omitted here, but the basic flow is pretty simple. > > > > After the Map/Reduce job to update the shards completes, the master > > will > > tell > > each shard server to "prepare" the new version of the index. After all > > the > > shard servers have responded affirmatively to the "prepare" message, > > the new > > > > index is ready to be queried. An index client will then lazily learn > > about > > the new index when it makes its next getShardLocations() call to the > > master. > > > > In essence, a lazy two-phase commit protocol is used, with "prepare" > > and > > "commit" messages piggybacked on heartbeats. After a shard has > switched > > to > > the new index, the Lucene files in the old index that are no longer > > needed > > can safely be deleted. > > > > ACHIEVING FAULT-TOLERANCE > > We rely on the fault-tolerance of Map/Reduce to guarantee that an > index > > update > > will eventually succeed. All shards are stored in HDFS and can be read > > by > > any > > shard server in a cluster. For a given shard, if one of its shard > > servers > > dies, > > new search requests are handled by its surviving shard servers. To > > ensure > > that > > there is always enough coverage for a shard, the master will instruct > > other > > shard servers to take over the shards of a dead shard server. > > > > PERFORMANCE ISSUES > > Currently, each shard server reads a shard directly from HDFS. > > Experiments > > have shown that this approach does not perform very well, with HDFS > > causing > > Lucene to slow down fairly dramatically (by well over 5x when data > > blocks > > are > > accessed over the network). Consequently, we are exploring different > > ways to > > leverage the fault tolerance of HDFS and, at the same time, work > around > > its > > performance problems. One simple alternative is to add a local file > > system > > cache on each shard server. Another alternative is to modify HDFS so > > that an > > application has more control over where to store the primary and > > replicas of > > an HDFS block. This feature may be useful for other HDFS applications > > (e.g., > > HBase). We would like to collaborate with other people who are > > interested in > > adding this feature to HDFS. > > > > > > Regards, > > Ning Li >