Re: Lucene-based Distributed Index Leveraging Hadoop
Stefan, Is there any material published about this? I would be very interested to see Marks slides and hear about the discussion. I believe all the slides will be available. I'll post a link as soon as I have one. Please keep in mind that katta is very young and compass or solr might be more interesting if you need something working now, though they might have different goals and focus than dlucene or katta. I am looking to have something working relatively quickly, but my performance needs and use cases are relatively modest (for now) so some degree of string and sticky tape in the implementation is okay in the short term. My main aim is to ensure that whatever I implement scales horizontally without too much drama. In terms of which project best fits my needs my gut feeling is that dlucene is pretty close. It supports incremental updates, and doesn't build in dependencies on systems like HDFS or Terracotta (I don't yet understand all the implications of those systems so would rather keep things simple if possible). The obvious drawback being that dlucene doesn't seem to be an active public project. Thanks for the reply Stefan. I'll certainly be taking a look through the code for Katta since no doubt there's a lot to learn in there. All the best... Rich
Re: Lucene-based Distributed Index Leveraging Hadoop
Stefan, I've got a lot of reading and learning to do :o) Thanks for the info, and good luck with your deployment. Rich
Re: Lucene-based Distributed Index Leveraging Hadoop
Should we start from scratch or with a code contribution? Someone still want to contribute its implementation? I just noticed - to late though - Ning already contributed the code to hadoop. So I guess my question should be rephrased what is the idea of moving this into a own project?
Re: Lucene-based Distributed Index Leveraging Hadoop
I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D. On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote: There seem to be a few other players in this space too. Are you from Rackspace? (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- query-terabytes-data) AOL also has a Hadoop/Solr project going on. CNET does not have much brewing there. Although Yonik and I had talked about it a bunch -- but that was long ago. --cw Clay Webster tel:1.908.541.3724 Associate VP, Platform Infrastructure http://www.cnet.com CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED] -Original Message- From: Ning Li [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 06, 2008 1:57 PM To: general@lucene.apache.org; [EMAIL PROTECTED]; solr- [EMAIL PROTECTED] Subject: Lucene-based Distributed Index Leveraging Hadoop There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's Index Server Project Proposal at http://www.mail-archive.com/general@lucene.apache.org/msg00338.html 2) Solr's Distributed Search at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's Distributed Lucene at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed index is partitioned into shards. Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more shard servers. Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A master keeps track of the shard servers and the shards being served by them. An application updates and queries the global index through an index client. An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version of the index. ShardServerCommand sendHeartbeat(); QUERYING THE INDEX To query the index, an application sends a search request to an index client. The index client then calls the shard server search() method for each shard of the index, merges the results and returns them to the application. The index client caches the mapping between shards and shard servers by periodically calling the master's getShardLocations() method. UPDATING THE INDEX USING MAP/REDUCE To update the index, an application sends an update request to an index client. The index client then calls the master's updateShards() method, which schedules a Map/Reduce job to update the index. The Map/Reduce job updates the shards in parallel and copies the new index files of each shard (i.e., Lucene instance) to HDFS. The updateShards() method includes a configuration, which provides information for updating the shards. More specifically, the configuration includes the following information: - Input path. This provides the location of updated documents, e.g
Re: Lucene-based Distributed Index Leveraging Hadoop
Clay Webster wrote: There seem to be a few other players in this space too. Are you from Rackspace? (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop- query-terabytes-data) AOL also has a Hadoop/Solr project going on. CNET does not have much brewing there. Although Yonik and I had talked about it a bunch -- but that was long ago. Hi. AOL has a couple of projects going on in the lucene/hadoop/solr space, and we will be pushing more stuff out as we can. We don't have anything going with solr over hadoop at the moment. I'm not sure if this would be better than what SOLR-303 does, but you should have a look at the work being done there. One of the things you mentioned is that the data sets are disjoint. SOLR-303 doesn't require this, and allows us to have a document stored in multiple shards (with different caching/update characteristics). --cw Clay Webster tel:1.908.541.3724 Associate VP, Platform Infrastructure http://www.cnet.com CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]
Re: Lucene-based Distributed Index Leveraging Hadoop
No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D.
Re: Lucene-based Distributed Index Leveraging Hadoop
Ning Li wrote: One main focus is to provide fault-tolerance in this distributed index system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging results from multiple shards right now. We'd like to start an open source project for a fault-tolerant distributed index system (or join if one already exists) if there is enough interest. Making Solr work on top of such a system could be an important goal and SOLR-303 is a big part of it in that case. I guess it depends on how you set up your shards in 303. We plan on having a master/slave relationship on each shard, so that each shard would sync the same way solr does currently. regards Ian I should have made it clear that disjoint data sets are not a requirement of the system. On Feb 6, 2008 12:57 PM, Ian Holsman [EMAIL PROTECTED] wrote: Hi. AOL has a couple of projects going on in the lucene/hadoop/solr space, and we will be pushing more stuff out as we can. We don't have anything going with solr over hadoop at the moment. I'm not sure if this would be better than what SOLR-303 does, but you should have a look at the work being done there. One of the things you mentioned is that the data sets are disjoint. SOLR-303 doesn't require this, and allows us to have a document stored in multiple shards (with different caching/update characteristics).