Re: Creating Lucene index in Hadoop
Lucene on a local disk benefits significantly from the local filesystem's RAM cache (aka the kernel's buffer cache). HDFS has no such local RAM cache outside of the stream's buffer. The cache would need to be no larger than the kernel's buffer cache to get an equivalent hit ratio. And if you're If the two cache sizes are the same, then yes. Just that local FS cache size is adjusted (more?) dynamically. Cheers, Ning
Re: Creating Lucene index in Hadoop
I'm missing why you would ever want the Lucene index in HDFS for reading. The Lucene indexes are written to HDFS, but that does not mean you conduct search on the indexes stored in HDFS directly. HDFS is not designed for random access. Usually the indexes are copied to the nodes where search will be served. With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. Cheers, Ning On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff ian.sobor...@nist.gov wrote: Does anyone have stats on how multiple readers on an optimized Lucene index in HDFS compares with a ParallelMultiReader (or whatever its called) over RPC on a local filesystem? I'm missing why you would ever want the Lucene index in HDFS for reading. Ian Ning Li ning.li...@gmail.com writes: I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis. As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li ning.li...@gmail.com writes: Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Creating Lucene index in Hadoop
1 is good. But for 2: - Won't it have a security concern as well? Or is this not a general local cache? - You are referring to caching in RAM, not caching in local FS, right? In general, a Lucene index size could be quite large. We may have to cache a lot of data to reach a reasonable hit ratio... Cheers, Ning On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting cutt...@apache.org wrote: Ning Li wrote: With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. I don't think HADOOP-4801 is required. It would help, certainly, but it's so fraught with security and other issues that I doubt it will be committed anytime soon. What would probably help HDFS random access performance for Lucene significantly would be: 1. A cache of connections to datanodes, so that each seek() does not require an open(). If we move HDFS data transfer to be RPC-based (see, e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come for free, since RPC already caches connections. We hope to do this for Hadoop 1.0, so that we use a single transport for all Hadoop's core operations, to simplify security. 2. A local cache of read-only HDFS data, equivalent to kernel's buffer cache. This might be implemented as a Lucene Directory that keeps an LRU cache of buffers from a wrapped filesystem, perhaps a subclass of RAMDirectory. With these, performance would still be slower than a local drive, but perhaps not so dramatically. Doug
Re: Creating Lucene index in Hadoop
Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Re: Distributed Lucene - from hadoop contrib
On 8/12/08, Deepika Khera [EMAIL PROTECTED] wrote: I was imagining the 2 concepts of i) using hadoop.contrib.index to index documents ii) providing search in a distributed fashion, to be all in one box. Ideally, yes. However, while it's good to use map/reduce when batch-building index, there is no consensus whether it'll be a good idea to serve index on HDFS. This is because of the poor performance of random reads in HDFS. On 8/14/08, Anoop Bhatti [EMAIL PROTECTED] wrote: I'd like to know if I'm heading down the right path, so my questions are: * Has anyone tried searching a distributed Lucene index using a method like this before? It seems too easy. Are there any gotchas that I should look out for as I scale up to more nodes and a larger index? * Do you think that going ahead with this approach, which consists of 1) creating a Lucene index using the hadoop.contrib.index code (thanks, Ning!) and 2) leaving that index in-place on hdfs and searching over it using the client code below, is a good approach? Yes, the code works on a single index shard. There is the performance concern described above. More importantly, as your index scales out, there will be multiple shards, and there are the challenges of load balance and fault tolerance, etc. * What is the status of the bailey project? It seems to be working on the same type of problem. Should I wait until that project comes out with code? There is no timeline for Bailey right now. Ning
Re: Distributed Lucene - from hadoop contrib
1) Katta n Distributed Lucene are different projects though, right? Both being based on kind of the same paradigm (Distributed Index)? The design of Katta and that of Distributed Lucene are quite different last time I checked. I pointed out the Katta project because you can find the code for Distributed Lucene there. 2) So, I should be able to use the hadoop.contrib.index with HDFS. Though, it would be much better if it is integrated with Distributed Lucene or the Katta project as these are designed keeping the structure and behavior of indexes in mind. Right? As described in the README file, hadoop.contrib.index uses map/reduce to build Lucene instances. It does not contain a component that serves queries. If that's not sufficient for you, you can check out the designs of Katta and Distributed Index and see which one suits your use better. Ning
Re: Distributed Lucene - from hadoop contrib
http://wiki.apache.org/hadoop/DistributedLucene and hadoop.contrib.index are two different things. For information on hadoop.contrib.index, see the README file in the package. I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene at http://katta.wiki.sourceforge.net/. Ning On 8/7/08, Deepika Khera [EMAIL PROTECTED] wrote: Hey guys, I would appreciate any feedback on this Deepika -Original Message- From: Deepika Khera [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 5:39 PM To: core-user@hadoop.apache.org Subject: Distributed Lucene - from hadoop contrib Hi, I am planning to use distributed lucene from hadoop.contrib.index for indexing. Has anyone used this or tested it? Any issues or comments? I see that the design described is different from HDFS (Namenode is stateless, stores no information regarding blocks for files, etc) . Does anyone know how hard will it be to setup this kind of system or is there something that can be reused. A reference link - http://wiki.apache.org/hadoop/DistributedLucene Thanks, Deepika
Re: Hadoop: Multiple map reduce or some better way
You can build Lucene indexes using Hadoop Map/Reduce. See the index contrib package in the trunk. Or is it still not something you are looking for? Regards, Ning On 4/4/08, Aayush Garg [EMAIL PROTECTED] wrote: No, currently my requirement is to solve this problem by apache hadoop. I am trying to build up this type of inverted index and then measure performance criteria with respect to others. Thanks, On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning [EMAIL PROTECTED] wrote: Are you implementing this for instruction or production? If production, why not use Lucene? On 4/3/08 6:45 PM, Aayush Garg [EMAIL PROTECTED] wrote: HI Amar , Theodore, Arun, Thanks for your reply. Actaully I am new to hadoop so cant figure out much. I have written following code for inverted index. This code maps each word from the document to its document id. ex: apple file1 file123 Main functions of the code are:- public class HadoopProgram extends Configured implements Tool { public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, Text { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private Text doc = new Text(); private long numRecords=0; private String inputFile; public void configure(JobConf job){ System.out.println(Configure function is called); inputFile = job.get(map.input.file); System.out.println(In conf the input file is+inputFile); } public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); doc.set(inputFile); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word,doc); } if(++numRecords%4==0){ System.out.println(Finished processing of input file+inputFile); } } } /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements ReducerText, Text, Text, DocIDs { // This works as K2, V2, K3, V3 public void reduce(Text key, IteratorText values, OutputCollectorText, DocIDs output, Reporter reporter) throws IOException { int sum = 0; Text dummy = new Text(); ArrayListString IDs = new ArrayListString(); String str; while (values.hasNext()) { dummy = values.next(); str = dummy.toString(); IDs.add(str); } DocIDs dc = new DocIDs(); dc.setListdocs(IDs); output.collect(key,dc); } } public int run(String[] args) throws Exception { System.out.println(Run function is called); JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName(wordcount); // the keys are words (strings) conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); } Now I am getting output array from the reducer as:- word \root\test\test123, \root\test12 In the next stage I want to stop 'stop words', scrub words etc. and like position of the word in the document. How would I apply multiple maps or multilevel map reduce jobs programmatically? I guess I need to make another class or add some functions in it? I am not able to figure it out. Any pointers for these type of problems? Thanks, Aayush On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat [EMAIL PROTECTED] wrote: On Wed, 26 Mar 2008, Aayush Garg wrote: HI, I am developing the simple inverted index program frm the hadoop. My map function has the output: word, doc and the reducer has: word, list(docs) Now I want to use one more mapreduce to remove stop and scrub words from Use distributed cache as Arun mentioned. this output. Also in the next stage I would like to have short summay Whether to use a separate MR job depends on what exactly you mean by summary. If its like a window around the current word then you can possibly do it in one go. Amar associated with every word. How should I design my program from this stage? I mean how would I apply multiple mapreduce to this? What would be the better way to perform this? Thanks, Regards, - -- Aayush Garg, Phone: +41 76 482 240
Re: Nutch and Distributed Lucene
Hi, Nutch builds Lucene indexes. But Nutch is much more than that. It is a web search application software that crawls the web, inverts links and builds indexes. Each step is one or more Map/Reduce jobs. You can find more information at http://lucene.apache.org/nutch/ The Map/Reduce job to build Lucene indexes in Nutch is customized to the data schema/structures used in Nutch. The index contrib package in Hadoop provides a general/configurable process to build Lucene indexes in parallel using a Map/Reduce job. That's the main difference. There is also the difference that the index build job in Nutch builds indexes in reduce tasks, while the index contrib package builds indexes in both map and reduce tasks and there are advantages in doing that... Regards, Ning On 4/1/08, Naama Kraus [EMAIL PROTECTED] wrote: Hi, I'd like to know if Nutch is running on top of Lucene, or is it non related to Lucene. I.e. indexing, parsing, crawling, internal data structures ... - all written from scratch using MapReduce (my impression) ? What is the relation between Nutch and the distributed Lucene patch that was inserted lately into Hadoop ? Thanks for any enlightening, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein)
A contrib package to build/update a Lucene index
Hi, Is there any interest in a contrib package to build/update a Lucene index? I should have asked the question before creating the JIRA issue and attaching the patch. In any case, more details can be found at https://issues.apache.org/jira/browse/HADOOP-2951 Regards, Ning
Re: Lucene-based Distributed Index Leveraging Hadoop
We welcome your input. Discussions are mainly on [EMAIL PROTECTED] now (a thread with the same title). On 2/7/08, Dennis Kubes [EMAIL PROTECTED] wrote: This is actually something we were planning on building into Nutch. Dennis
Lucene-based Distributed Index Leveraging Hadoop
a consistent view of the shards in the index. The results of a search query include either all or none of a recent update to the index. The details of the algorithm to accomplish this are omitted here, but the basic flow is pretty simple. After the Map/Reduce job to update the shards completes, the master will tell each shard server to prepare the new version of the index. After all the shard servers have responded affirmatively to the prepare message, the new index is ready to be queried. An index client will then lazily learn about the new index when it makes its next getShardLocations() call to the master. In essence, a lazy two-phase commit protocol is used, with prepare and commit messages piggybacked on heartbeats. After a shard has switched to the new index, the Lucene files in the old index that are no longer needed can safely be deleted. ACHIEVING FAULT-TOLERANCE We rely on the fault-tolerance of Map/Reduce to guarantee that an index update will eventually succeed. All shards are stored in HDFS and can be read by any shard server in a cluster. For a given shard, if one of its shard servers dies, new search requests are handled by its surviving shard servers. To ensure that there is always enough coverage for a shard, the master will instruct other shard servers to take over the shards of a dead shard server. PERFORMANCE ISSUES Currently, each shard server reads a shard directly from HDFS. Experiments have shown that this approach does not perform very well, with HDFS causing Lucene to slow down fairly dramatically (by well over 5x when data blocks are accessed over the network). Consequently, we are exploring different ways to leverage the fault tolerance of HDFS and, at the same time, work around its performance problems. One simple alternative is to add a local file system cache on each shard server. Another alternative is to modify HDFS so that an application has more control over where to store the primary and replicas of an HDFS block. This feature may be useful for other HDFS applications (e.g., HBase). We would like to collaborate with other people who are interested in adding this feature to HDFS. Regards, Ning Li
Re: Lucene-based Distributed Index Leveraging Hadoop
On 2/6/08, Ted Dunning [EMAIL PROTECTED] wrote: Our best work-around is to simply take a shard out of service during delivery of an updated index. This is obviously not a good solution. How many shard servers are serving each shard? If it's more than one, you can have the rest of the shard servers sharing the query workload while one shard server loads a new version of a shard. We'd like to start an open source project for this (or join if one already exists) if there is enough interest.