Re: Lucene-based Distributed Index Leveraging Hadoop

Stefan Groschupf Thu, 03 Apr 2008 19:45:49 -0700

Hi All,

we are also very much interested in such a system and actually have torealize such a system for an project within the next 3 month.I would prefer to work on a open source solution instead of doinganother one behind closed doors, though we would need to start codingpretty soon. We have 3 fulltime developers we could contribute forthis time to such a project.

I'm happy to do all the organisational work like setting up thecomplete infrastructure etc to get it started.I suggest we start with an sourceforge project since this is fast tosetup and if we qualify for apache as an lucene or hadoop subprojectmigrate there later, or is it easy to start a apache incubator project?


We might just need a nice name for the project. Doug, any idea? :-)

Should we start from scratch or with a code contribution?
Someone still want to contribute its implementation?


Thanks.
Stefan







On Feb 6, 2008, at 10:57 AM, Ning Li wrote:

There have been several proposals for a Lucene-based distributed index
architecture.
1) Doug Cutting's "Index Server Project Proposal" at
   http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
2) Solr's "Distributed Search" at
   http://wiki.apache.org/solr/DistributedSearch
3) Mark Butler's "Distributed Lucene" at
   http://wiki.apache.org/hadoop/DistributedLucene
We have also been working on a Lucene-based distributed indexarchitecture.Our design differs from the above proposals in the way it leveragesHadoopas much as possible. In particular, HDFS is used to reliably storeLucene
instances, Map/Reduce is used to analyze documents and update Lucene
instances
in parallel, and Hadoop's IPC framework is used. Our design isgeared forapplications that require a highly scalable index and where batchupdatesto each Lucene instance are acceptable (verses finer-graineddocument at
a time updates).

We have a working implementation of our design and are in the process
of evaluating its performance. An overview of our design is providedbelow.We welcome feedback and would like to know if you are interested inworkingon it. If so, we would be happy to make the code publicly available.At thesame time, we would like to collaborate with people working onexisting
proposals and see if we can consolidate our efforts.

TERMINOLOGY
A distributed "index" is partitioned into "shards". Each shardcorresponds
to
a Lucene instance and contains a disjoint subset of the documents inthe
index.
Each shard is stored in HDFS and served by one or more "shardservers". Herewe only talk about a single distributed index, but in practicemultiple
indexes
can be supported.
A "master" keeps track of the shard servers and the shards beingserved by
them. An "application" updates and queries the global index through an
"index client". An index client communicates with the shard servers to
execute a query.

KEY RPC METHODS
This section lists the key RPC methods in our design. To simplify the
discussion, some of their parameters have been omitted.

 On the Shard Servers
   // Execute a query on this shard server's Lucene instance.
   // This method is called by an index client.
   SearchResults search(Query query);

 On the Master
   // Tell the master to update the shards, i.e., Lucene instances.
   // This method is called by an index client.
   boolean updateShards(Configuration conf);

   // Ask the master where the shards are located.
   // This method is called by an index client.
   LocatedShards getShardLocations();

   // Send a heartbeat to the master. This method is called by a
   // shard server. In the response, the master informs the
   // shard server when to switch to a newer version of the index.
   ShardServerCommand sendHeartbeat();

QUERYING THE INDEX
To query the index, an application sends a search request to an index
client.
The index client then calls the shard server search() method foreach shardof the index, merges the results and returns them to theapplication. The
index client caches the mapping between shards and shard servers by
periodically calling the master's getShardLocations() method.

UPDATING THE INDEX USING MAP/REDUCE
To update the index, an application sends an update request to anindex
client.
The index client then calls the master's updateShards() method, which
schedules
a Map/Reduce job to update the index. The Map/Reduce job updates theshards
in
parallel and copies the new index files of each shard (i.e., Lucene
instance)
to HDFS.

The updateShards() method includes a "configuration", which provides
information for updating the shards. More specifically, theconfiguration
includes the following information:
- Input path. This provides the location of updated documents,e.g., HDFS
   files or directories, or HBase tables.
 - Input formatter. This specifies how to format the input documents.
- Analysis. This defines the analyzer to use on the input. Theanalyzerdetermines whether a document is being inserted, updated, ordeleted.
For
inserts or updates, the analyzer also converts each inputdocument into
   a Lucene document.

The Map phase of the Map/Reduce job formats and analyzes the input (in
parallel), while the Reduce phase collects and applies the updatesto eachLucene instance (again in parallel). The updates are applied usingthe local
file system where a Reduce task runs and then copied back to HDFS. For
example,
if the updates caused a new Lucene segment to be created, the newsegmentwould be created on the local file system first, and then copiedback to
HDFS.
When the Map/Reduce job completes, a "new version" of the index isready to
be
queried. It is important to note that the new version of the indexis not
derived from scratch. By leveraging Lucene's update algorithm, the new
version
of each Lucene instance will share as many files as possible as theprevious
version.

ENSURING INDEX CONSISTENCY
At any point in time, an index client always has a consistent viewof theshards in the index. The results of a search query include eitherall or
none
of a recent update to the index. The details of the algorithm toaccomplish
this are omitted here, but the basic flow is pretty simple.
After the Map/Reduce job to update the shards completes, the masterwill
tell
each shard server to "prepare" the new version of the index. Afterall theshard servers have responded affirmatively to the "prepare" message,the new
index is ready to be queried. An index client will then lazily learnaboutthe new index when it makes its next getShardLocations() call to themaster.
In essence, a lazy two-phase commit protocol is used, with "prepare"and"commit" messages piggybacked on heartbeats. After a shard hasswitched tothe new index, the Lucene files in the old index that are no longerneeded
can safely be deleted.

ACHIEVING FAULT-TOLERANCE
We rely on the fault-tolerance of Map/Reduce to guarantee that anindex
update
will eventually succeed. All shards are stored in HDFS and can beread by
any
shard server in a cluster. For a given shard, if one of its shardservers
dies,
new search requests are handled by its surviving shard servers. Toensure
that
there is always enough coverage for a shard, the master willinstruct other
shard servers to take over the shards of a dead shard server.

PERFORMANCE ISSUES
Currently, each shard server reads a shard directly from HDFS.Experimentshave shown that this approach does not perform very well, with HDFScausingLucene to slow down fairly dramatically (by well over 5x when datablocks
are
accessed over the network). Consequently, we are exploring differentways toleverage the fault tolerance of HDFS and, at the same time, workaround itsperformance problems. One simple alternative is to add a local filesystemcache on each shard server. Another alternative is to modify HDFS sothat anapplication has more control over where to store the primary andreplicas ofan HDFS block. This feature may be useful for other HDFSapplications (e.g.,HBase). We would like to collaborate with other people who areinterested in
adding this feature to HDFS.


Regards,
Ning Li


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com

Re: Lucene-based Distributed Index Leveraging Hadoop

Reply via email to