Re: Lucene-based Distributed Index Leveraging Hadoop

2008-08-22 Thread Richard Marr
Stefan,

 Is there any material published about this? I would be very interested to
 see Marks slides and hear about the discussion.

I believe all the slides will be available. I'll post a link as soon
as I have one.

 Please keep in mind that katta is very young and compass or solr might be
 more interesting if you need something working now, though they might have
 different goals and focus than dlucene or katta.

I am looking to have something working relatively quickly, but my
performance needs and use cases are relatively modest (for now) so
some degree of string and sticky tape in the implementation is okay in
the short term. My main aim is to ensure that whatever I implement
scales horizontally without too much drama.

In terms of which project best fits my needs my gut feeling is that
dlucene is pretty close. It supports incremental updates, and doesn't
build in dependencies on systems like HDFS or Terracotta (I don't yet
understand all the implications of those systems so would rather keep
things simple if possible). The obvious drawback being that dlucene
doesn't seem to be an active public project.

Thanks for the reply Stefan. I'll certainly be taking a look through
the code for Katta since no doubt there's a lot to learn in there.

All the best...

Rich


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-08-22 Thread Richard Marr
Stefan,

I've got a lot of reading and learning to do  :o)

Thanks for the info, and good luck with your deployment.

Rich


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-04-03 Thread Stefan Groschupf

Should we start from scratch or with a code contribution?
Someone still want to contribute its implementation?
I just noticed - to late though - Ning already contributed the code to  
hadoop. So I guess my question should be rephrased what is the idea of  
moving this into a own project?




Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?

J.D.



On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote:

 There seem to be a few other players in this space too.

 Are you from Rackspace?
 (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
 query-terabytes-data)

 AOL also has a Hadoop/Solr project going on.

 CNET does not have much brewing there.  Although Yonik and I had
 talked about it a bunch -- but that was long ago.

 --cw

 Clay Webster   tel:1.908.541.3724
 Associate VP, Platform Infrastructure http://www.cnet.com
 CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]


  -Original Message-
  From: Ning Li [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 06, 2008 1:57 PM
  To: general@lucene.apache.org; [EMAIL PROTECTED]; solr-
  [EMAIL PROTECTED]
  Subject: Lucene-based Distributed Index Leveraging Hadoop
 
  There have been several proposals for a Lucene-based distributed index
  architecture.
   1) Doug Cutting's Index Server Project Proposal at
 
 http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
   2) Solr's Distributed Search at
  http://wiki.apache.org/solr/DistributedSearch
   3) Mark Butler's Distributed Lucene at
  http://wiki.apache.org/hadoop/DistributedLucene
 
  We have also been working on a Lucene-based distributed index
  architecture.
  Our design differs from the above proposals in the way it leverages
  Hadoop
  as much as possible. In particular, HDFS is used to reliably store
  Lucene
  instances, Map/Reduce is used to analyze documents and update Lucene
  instances
  in parallel, and Hadoop's IPC framework is used. Our design is geared
  for
  applications that require a highly scalable index and where batch
  updates
  to each Lucene instance are acceptable (verses finer-grained document
  at
  a time updates).
 
  We have a working implementation of our design and are in the process
  of evaluating its performance. An overview of our design is provided
  below.
  We welcome feedback and would like to know if you are interested in
  working
  on it. If so, we would be happy to make the code publicly available.
 At
  the
  same time, we would like to collaborate with people working on
 existing
  proposals and see if we can consolidate our efforts.
 
  TERMINOLOGY
  A distributed index is partitioned into shards. Each shard
  corresponds
  to
  a Lucene instance and contains a disjoint subset of the documents in
  the
  index.
  Each shard is stored in HDFS and served by one or more shard
 servers.
  Here
  we only talk about a single distributed index, but in practice
 multiple
  indexes
  can be supported.
 
  A master keeps track of the shard servers and the shards being
 served
  by
  them. An application updates and queries the global index through an
  index client. An index client communicates with the shard servers to
  execute a query.
 
  KEY RPC METHODS
  This section lists the key RPC methods in our design. To simplify the
  discussion, some of their parameters have been omitted.
 
On the Shard Servers
  // Execute a query on this shard server's Lucene instance.
  // This method is called by an index client.
  SearchResults search(Query query);
 
On the Master
  // Tell the master to update the shards, i.e., Lucene instances.
  // This method is called by an index client.
  boolean updateShards(Configuration conf);
 
  // Ask the master where the shards are located.
  // This method is called by an index client.
  LocatedShards getShardLocations();
 
  // Send a heartbeat to the master. This method is called by a
  // shard server. In the response, the master informs the
  // shard server when to switch to a newer version of the index.
  ShardServerCommand sendHeartbeat();
 
  QUERYING THE INDEX
  To query the index, an application sends a search request to an index
  client.
  The index client then calls the shard server search() method for each
  shard
  of the index, merges the results and returns them to the application.
  The
  index client caches the mapping between shards and shard servers by
  periodically calling the master's getShardLocations() method.
 
  UPDATING THE INDEX USING MAP/REDUCE
  To update the index, an application sends an update request to an
 index
  client.
  The index client then calls the master's updateShards() method, which
  schedules
  a Map/Reduce job to update the index. The Map/Reduce job updates the
  shards
  in
  parallel and copies the new index files of each shard (i.e., Lucene
  instance)
  to HDFS.
 
  The updateShards() method includes a configuration, which provides
  information for updating the shards. More specifically, the
  configuration
  includes the following information:
- Input path. This provides the location of updated documents, e.g

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ian Holsman

Clay Webster wrote:

There seem to be a few other players in this space too.

Are you from Rackspace?  
(http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-

query-terabytes-data)

AOL also has a Hadoop/Solr project going on.

CNET does not have much brewing there.  Although Yonik and I had 
talked about it a bunch -- but that was long ago. 
  


Hi.
AOL has a couple of projects going on in the lucene/hadoop/solr space, 
and we will be pushing more stuff out as we can. We don't have anything 
going with solr over hadoop at the moment.


I'm not sure if this would be better than what SOLR-303 does, but you 
should have a look at the work being done there.


One of the things you mentioned is that the data sets are disjoint. 
SOLR-303 doesn't require this, and allows us to have a document stored 
in multiple shards (with different caching/update characteristics).

--cw

Clay Webster   tel:1.908.541.3724
Associate VP, Platform Infrastructure http://www.cnet.com
CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]

  



  




Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote:

 I assume that Google also has distributed index over their
 GFS/MapReduce implementation. Any idea how they achieve this?

 J.D.



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ian Holsman

Ning Li wrote:

One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exists) if there is enough interest. Making Solr work on top of such
a system could be an important goal and SOLR-303 is a big part of it in that
case.
  


I guess it depends on how you set up your shards in 303.
We plan on having a master/slave relationship on each shard, so that 
each shard would sync the same way solr does currently.


regards
Ian


I should have made it clear that disjoint data sets are not a requirement of
the system.


On Feb 6, 2008 12:57 PM, Ian Holsman [EMAIL PROTECTED] wrote:

  

Hi.
AOL has a couple of projects going on in the lucene/hadoop/solr space,
and we will be pushing more stuff out as we can. We don't have anything
going with solr over hadoop at the moment.

I'm not sure if this would be better than what SOLR-303 does, but you
should have a look at the work being done there.

One of the things you mentioned is that the data sets are disjoint.
SOLR-303 doesn't require this, and allows us to have a document stored
in multiple shards (with different caching/update characteristics).