Re: Creating Lucene index in Hadoop

2009-03-17 Thread Ning Li
 Lucene on a local disk benefits significantly from the local filesystem's
 RAM cache (aka the kernel's buffer cache).  HDFS has no such local RAM cache
 outside of the stream's buffer.  The cache would need to be no larger than
 the kernel's buffer cache to get an equivalent hit ratio.  And if you're

If the two cache sizes are the same, then yes. Just that local FS
cache size is adjusted (more?) dynamically.


Cheers,
Ning


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
 I'm missing why you would ever want the Lucene index in HDFS for
 reading.

The Lucene indexes are written to HDFS, but that does not mean you
conduct search on the indexes stored in HDFS directly. HDFS is not
designed for random access. Usually the indexes are copied to the
nodes where search will be served. With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

Cheers,
Ning


On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 Does anyone have stats on how multiple readers on an optimized Lucene
 index in HDFS compares with a ParallelMultiReader (or whatever its
 called) over RPC on a local filesystem?

 I'm missing why you would ever want the Lucene index in HDFS for
 reading.

 Ian

 Ning Li ning.li...@gmail.com writes:

 I should have pointed out that Nutch index build and contrib/index
 targets different applications. The latter is for applications who
 simply want to build Lucene index from a set of documents - e.g. no
 link analysis.

 As to writing Lucene indexes, both work the same way - write the final
 results to local file system and then copy to HDFS. In contrib/index,
 the intermediate results are in memory and not written to HDFS.

 Hope it clarifies things.

 Cheers,
 Ning


 On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 I understand why you would index in the reduce phase, because the anchor
 text gets shuffled to be next to the document.  However, when you index
 in the map phase, don't you just have to reindex later?

 The main point to the OP is that HDFS is a bad FS for writing Lucene
 indexes because of how Lucene works.  The simple approach is to write
 your index outside of HDFS in the reduce phase, and then merge the
 indexes from each reducer manually.

 Ian

 Ning Li ning.li...@gmail.com writes:

 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark








Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?
  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...

Cheers,
Ning


On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting cutt...@apache.org wrote:
 Ning Li wrote:

 With
 http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
 become feasible to search on HDFS directly.

 I don't think HADOOP-4801 is required.  It would help, certainly, but it's
 so fraught with security and other issues that I doubt it will be committed
 anytime soon.

 What would probably help HDFS random access performance for Lucene
 significantly would be:
  1. A cache of connections to datanodes, so that each seek() does not
 require an open().  If we move HDFS data transfer to be RPC-based (see,
 e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
 for free, since RPC already caches connections.  We hope to do this for
 Hadoop 1.0, so that we use a single transport for all Hadoop's core
 operations, to simplify security.
  2. A local cache of read-only HDFS data, equivalent to kernel's buffer
 cache.  This might be implemented as a Lucene Directory that keeps an LRU
 cache of buffers from a wrapped filesystem, perhaps a subclass of
 RAMDirectory.

 With these, performance would still be slower than a local drive, but
 perhaps not so dramatically.

 Doug



Re: Creating Lucene index in Hadoop

2009-03-13 Thread Ning Li
Or you can check out the index contrib. The difference of the two is that:
  - In Nutch's indexing map/reduce job, indexes are built in the
reduce phase. Afterwards, they are merged into smaller number of
shards if necessary. The last time I checked, the merge process does
not use map/reduce.
  - In contrib/index, small indexes are built in the map phase. They
are merged into the desired number of shards in the reduce phase. In
addition, they can be merged into existing shards.

Cheers,
Ning


On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark




Re: Distributed Lucene - from hadoop contrib

2008-08-18 Thread Ning Li
On 8/12/08, Deepika Khera [EMAIL PROTECTED] wrote:
 I was imagining the 2 concepts of i) using hadoop.contrib.index to index
 documents ii) providing search in a distributed fashion, to be all in
 one box.

Ideally, yes. However, while it's good to use map/reduce when
batch-building index, there is no consensus whether it'll be a good
idea to serve index on HDFS. This is because of the poor performance
of random reads in HDFS.


On 8/14/08, Anoop Bhatti [EMAIL PROTECTED] wrote:
 I'd like to know if I'm heading down the right path, so my questions are:
 * Has anyone tried searching a distributed Lucene index using a method
 like this before?  It seems too easy.  Are there any gotchas that I
 should look out for as I scale up to more nodes and a larger index?
 * Do you think that going ahead with this approach, which consists of
 1) creating a Lucene index using the  hadoop.contrib.index code
 (thanks, Ning!) and 2) leaving that index in-place on hdfs and
 searching over it using the client code below, is a good approach?

Yes, the code works on a single index shard. There is the performance
concern described above. More importantly, as your index scales out,
there will be multiple shards, and there are the challenges of load
balance and fault tolerance, etc.


 * What is the status of the bailey project?  It seems to be working on
 the same type of problem. Should I wait until that project comes out
 with code?

There is no timeline for Bailey right now.

Ning


Re: Distributed Lucene - from hadoop contrib

2008-08-08 Thread Ning Li
 1) Katta n Distributed Lucene are different projects though, right? Both
 being based on kind of the same paradigm (Distributed Index)?

The design of Katta and that of Distributed Lucene are quite different
last time I checked. I pointed out the Katta project because you can
find the code for Distributed Lucene there.

 2) So, I should be able to use the hadoop.contrib.index with HDFS.
 Though, it would be much better if it is integrated with Distributed
 Lucene or the Katta project as these are designed keeping the
 structure and behavior of indexes in mind. Right?

As described in the README file, hadoop.contrib.index uses map/reduce
to build Lucene instances. It does not contain a component that serves
queries. If that's not sufficient for you, you can check out the
designs of Katta and Distributed Index and see which one suits your
use better.

Ning


Re: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Ning Li
http://wiki.apache.org/hadoop/DistributedLucene
and hadoop.contrib.index are two different things.

For information on hadoop.contrib.index, see the README file in the package.

I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene
at http://katta.wiki.sourceforge.net/.

Ning


On 8/7/08, Deepika Khera [EMAIL PROTECTED] wrote:
 Hey guys,

 I would appreciate any feedback on this

 Deepika

 -Original Message-
 From: Deepika Khera [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 06, 2008 5:39 PM
 To: core-user@hadoop.apache.org
 Subject: Distributed Lucene - from hadoop contrib

 Hi,



 I am planning to use distributed lucene from hadoop.contrib.index for
 indexing. Has anyone used this or tested it? Any issues or comments?



 I see that the design described is different from HDFS (Namenode is
 stateless, stores no information regarding blocks for files, etc) . Does
 anyone know how hard will it be to setup this kind of system or is there
 something that can be reused.



 A reference link -



 http://wiki.apache.org/hadoop/DistributedLucene



 Thanks,
 Deepika




Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ning Li
You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?

Regards,
Ning

On 4/4/08, Aayush Garg [EMAIL PROTECTED] wrote:
 No, currently my requirement is to solve this problem by apache hadoop. I am
 trying to build up this type of inverted index and then measure performance
 criteria with respect to others.

 Thanks,


 On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning [EMAIL PROTECTED] wrote:

 
  Are you implementing this for instruction or production?
 
  If production, why not use Lucene?
 
 
  On 4/3/08 6:45 PM, Aayush Garg [EMAIL PROTECTED] wrote:
 
   HI  Amar , Theodore, Arun,
  
   Thanks for your reply. Actaully I am new to hadoop so cant figure out
  much.
   I have written following code for inverted index. This code maps each
  word
   from the document to its document id.
   ex: apple file1 file123
   Main functions of the code are:-
  
   public class HadoopProgram extends Configured implements Tool {
   public static class MapClass extends MapReduceBase
   implements MapperLongWritable, Text, Text, Text {
  
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();
   private Text doc = new Text();
   private long numRecords=0;
   private String inputFile;
  
  public void configure(JobConf job){
   System.out.println(Configure function is called);
   inputFile = job.get(map.input.file);
   System.out.println(In conf the input file is+inputFile);
   }
  
  
   public void map(LongWritable key, Text value,
   OutputCollectorText, Text output,
   Reporter reporter) throws IOException {
 String line = value.toString();
 StringTokenizer itr = new StringTokenizer(line);
 doc.set(inputFile);
 while (itr.hasMoreTokens()) {
   word.set(itr.nextToken());
   output.collect(word,doc);
 }
 if(++numRecords%4==0){
  System.out.println(Finished processing of input
  file+inputFile);
}
   }
 }
  
 /**
  * A reducer class that just emits the sum of the input values.
  */
 public static class Reduce extends MapReduceBase
   implements ReducerText, Text, Text, DocIDs {
  
 // This works as K2, V2, K3, V3
   public void reduce(Text key, IteratorText values,
  OutputCollectorText, DocIDs output,
  Reporter reporter) throws IOException {
 int sum = 0;
 Text dummy = new Text();
 ArrayListString IDs = new ArrayListString();
 String str;
  
 while (values.hasNext()) {
dummy = values.next();
str = dummy.toString();
IDs.add(str);
  }
  DocIDs dc = new DocIDs();
  dc.setListdocs(IDs);
 output.collect(key,dc);
   }
 }
  
public int run(String[] args) throws Exception {
 System.out.println(Run function is called);
   JobConf conf = new JobConf(getConf(), WordCount.class);
   conf.setJobName(wordcount);
  
   // the keys are words (strings)
   conf.setOutputKeyClass(Text.class);
  
   conf.setOutputValueClass(Text.class);
  
  
   conf.setMapperClass(MapClass.class);
  
   conf.setReducerClass(Reduce.class);
   }
  
  
   Now I am getting output array from the reducer as:-
   word \root\test\test123, \root\test12
  
   In the next stage I want to stop 'stop  words',  scrub words etc. and
  like
   position of the word in the document. How would I apply multiple maps or
   multilevel map reduce jobs programmatically? I guess I need to make
  another
   class or add some functions in it? I am not able to figure it out.
   Any pointers for these type of problems?
  
   Thanks,
   Aayush
  
  
   On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat [EMAIL PROTECTED]
  wrote:
  
   On Wed, 26 Mar 2008, Aayush Garg wrote:
  
   HI,
   I am developing the simple inverted index program frm the hadoop. My
  map
   function has the output:
   word, doc
   and the reducer has:
   word, list(docs)
  
   Now I want to use one more mapreduce to remove stop and scrub words
  from
   Use distributed cache as Arun mentioned.
   this output. Also in the next stage I would like to have short summay
   Whether to use a separate MR job depends on what exactly you mean by
   summary. If its like a window around the current word then you can
   possibly do it in one go.
   Amar
   associated with every word. How should I design my program from this
   stage?
   I mean how would I apply multiple mapreduce to this? What would be the
   better way to perform this?
  
   Thanks,
  
   Regards,
   -
  
  
  
 
 


 --
 Aayush Garg,
 Phone: +41 76 482 240



Re: Nutch and Distributed Lucene

2008-04-01 Thread Ning Li
Hi,

Nutch builds Lucene indexes. But Nutch is much more than that. It is a
web search application software that crawls the web, inverts links and
builds indexes. Each step is one or more Map/Reduce jobs. You can find
more information at http://lucene.apache.org/nutch/

The Map/Reduce job to build Lucene indexes in Nutch is customized to
the data schema/structures used in Nutch. The index contrib package in
Hadoop provides a general/configurable process to build Lucene indexes
in parallel using a Map/Reduce job. That's the main difference. There
is also the difference that the index build job in Nutch builds
indexes in reduce tasks, while the index contrib package builds
indexes in both map and reduce tasks and there are advantages in doing
that...

Regards,
Ning


On 4/1/08, Naama Kraus [EMAIL PROTECTED] wrote:
 Hi,

 I'd like to know if Nutch is running on top of Lucene, or is it non related
 to Lucene. I.e. indexing, parsing, crawling, internal data structures ... -
 all written from scratch using MapReduce (my impression) ?

 What is the relation between Nutch and the distributed Lucene patch that was
 inserted lately into Hadoop ?

 Thanks for any enlightening,
 Naama

 --
 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
 00 oo 00 oo
 If you want your children to be intelligent, read them fairy tales. If you
 want them to be more intelligent, read them more fairy tales. (Albert
 Einstein)



A contrib package to build/update a Lucene index

2008-03-10 Thread Ning Li
Hi,

Is there any interest in a contrib package to build/update a Lucene index?

I should have asked the question before creating the JIRA issue and
attaching the patch. In any case, more details can be found at

https://issues.apache.org/jira/browse/HADOOP-2951


Regards,
Ning


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Ning Li
We welcome your input. Discussions are mainly on
[EMAIL PROTECTED] now (a thread with the same title).

On 2/7/08, Dennis Kubes [EMAIL PROTECTED] wrote:
 This is actually something we were planning on building into Nutch.

 Dennis


Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
 a consistent view of the
shards in the index. The results of a search query include either all or none
of a recent update to the index. The details of the algorithm to accomplish
this are omitted here, but the basic flow is pretty simple.

After the Map/Reduce job to update the shards completes, the master will tell
each shard server to prepare the new version of the index. After all the
shard servers have responded affirmatively to the prepare message, the new
index is ready to be queried. An index client will then lazily learn about
the new index when it makes its next getShardLocations() call to the master.
In essence, a lazy two-phase commit protocol is used, with prepare and
commit messages piggybacked on heartbeats. After a shard has switched to
the new index, the Lucene files in the old index that are no longer needed
can safely be deleted.

ACHIEVING FAULT-TOLERANCE
We rely on the fault-tolerance of Map/Reduce to guarantee that an index update
will eventually succeed. All shards are stored in HDFS and can be read by any
shard server in a cluster. For a given shard, if one of its shard servers dies,
new search requests are handled by its surviving shard servers. To ensure that
there is always enough coverage for a shard, the master will instruct other
shard servers to take over the shards of a dead shard server.

PERFORMANCE ISSUES
Currently, each shard server reads a shard directly from HDFS. Experiments
have shown that this approach does not perform very well, with HDFS causing
Lucene to slow down fairly dramatically (by well over 5x when data blocks are
accessed over the network). Consequently, we are exploring different ways to
leverage the fault tolerance of HDFS and, at the same time, work around its
performance problems. One simple alternative is to add a local file system
cache on each shard server. Another alternative is to modify HDFS so that an
application has more control over where to store the primary and replicas of
an HDFS block. This feature may be useful for other HDFS applications (e.g.,
HBase). We would like to collaborate with other people who are interested in
adding this feature to HDFS.


Regards,
Ning Li


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
On 2/6/08, Ted Dunning [EMAIL PROTECTED] wrote:
 Our best work-around is to simply take a shard out of service during delivery
 of an updated index.  This is obviously not a good solution.

How many shard servers are serving each shard? If it's more than one,
you can have the rest of the shard servers sharing the query workload
while one shard server loads a new version of a shard.

We'd like to start an open source project for this (or join if one
already exists) if there is enough interest.