Re: Creating Lucene index in Hadoop
> Lucene on a local disk benefits significantly from the local filesystem's > RAM cache (aka the kernel's buffer cache). HDFS has no such local RAM cache > outside of the stream's buffer. The cache would need to be no larger than > the kernel's buffer cache to get an equivalent hit ratio. And if you're If the two cache sizes are the same, then yes. Just that local FS cache size is adjusted (more?) dynamically. Cheers, Ning
Re: Creating Lucene index in Hadoop
1 is good. But for 2: - Won't it have a security concern as well? Or is this not a general local cache? - You are referring to caching in RAM, not caching in local FS, right? In general, a Lucene index size could be quite large. We may have to cache a lot of data to reach a reasonable hit ratio... Cheers, Ning On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting wrote: > Ning Li wrote: >> >> With >> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may >> become feasible to search on HDFS directly. > > I don't think HADOOP-4801 is required. It would help, certainly, but it's > so fraught with security and other issues that I doubt it will be committed > anytime soon. > > What would probably help HDFS random access performance for Lucene > significantly would be: > 1. A cache of connections to datanodes, so that each seek() does not > require an open(). If we move HDFS data transfer to be RPC-based (see, > e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come > for free, since RPC already caches connections. We hope to do this for > Hadoop 1.0, so that we use a single transport for all Hadoop's core > operations, to simplify security. > 2. A local cache of read-only HDFS data, equivalent to kernel's buffer > cache. This might be implemented as a Lucene Directory that keeps an LRU > cache of buffers from a wrapped filesystem, perhaps a subclass of > RAMDirectory. > > With these, performance would still be slower than a local drive, but > perhaps not so dramatically. > > Doug >
Re: Creating Lucene index in Hadoop
> I'm missing why you would ever want the Lucene index in HDFS for > reading. The Lucene indexes are written to HDFS, but that does not mean you conduct search on the indexes stored in HDFS directly. HDFS is not designed for random access. Usually the indexes are copied to the nodes where search will be served. With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. Cheers, Ning On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff wrote: > > Does anyone have stats on how multiple readers on an optimized Lucene > index in HDFS compares with a ParallelMultiReader (or whatever its > called) over RPC on a local filesystem? > > I'm missing why you would ever want the Lucene index in HDFS for > reading. > > Ian > > Ning Li writes: > >> I should have pointed out that Nutch index build and contrib/index >> targets different applications. The latter is for applications who >> simply want to build Lucene index from a set of documents - e.g. no >> link analysis. >> >> As to writing Lucene indexes, both work the same way - write the final >> results to local file system and then copy to HDFS. In contrib/index, >> the intermediate results are in memory and not written to HDFS. >> >> Hope it clarifies things. >> >> Cheers, >> Ning >> >> >> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote: >>> >>> I understand why you would index in the reduce phase, because the anchor >>> text gets shuffled to be next to the document. However, when you index >>> in the map phase, don't you just have to reindex later? >>> >>> The main point to the OP is that HDFS is a bad FS for writing Lucene >>> indexes because of how Lucene works. The simple approach is to write >>> your index outside of HDFS in the reduce phase, and then merge the >>> indexes from each reducer manually. >>> >>> Ian >>> >>> Ning Li writes: >>> >>>> Or you can check out the index contrib. The difference of the two is that: >>>> - In Nutch's indexing map/reduce job, indexes are built in the >>>> reduce phase. Afterwards, they are merged into smaller number of >>>> shards if necessary. The last time I checked, the merge process does >>>> not use map/reduce. >>>> - In contrib/index, small indexes are built in the map phase. They >>>> are merged into the desired number of shards in the reduce phase. In >>>> addition, they can be merged into existing shards. >>>> >>>> Cheers, >>>> Ning >>>> >>>> >>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote: >>>>> you can see the nutch code. >>>>> >>>>> 2009/3/13 Mark Kerzner >>>>> >>>>>> Hi, >>>>>> >>>>>> How do I allow multiple nodes to write to the same index file in HDFS? >>>>>> >>>>>> Thank you, >>>>>> Mark >>>>>> >>>>> >>> >>> > >
Re: Creating Lucene index in Hadoop
I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis. As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote: > > I understand why you would index in the reduce phase, because the anchor > text gets shuffled to be next to the document. However, when you index > in the map phase, don't you just have to reindex later? > > The main point to the OP is that HDFS is a bad FS for writing Lucene > indexes because of how Lucene works. The simple approach is to write > your index outside of HDFS in the reduce phase, and then merge the > indexes from each reducer manually. > > Ian > > Ning Li writes: > >> Or you can check out the index contrib. The difference of the two is that: >> - In Nutch's indexing map/reduce job, indexes are built in the >> reduce phase. Afterwards, they are merged into smaller number of >> shards if necessary. The last time I checked, the merge process does >> not use map/reduce. >> - In contrib/index, small indexes are built in the map phase. They >> are merged into the desired number of shards in the reduce phase. In >> addition, they can be merged into existing shards. >> >> Cheers, >> Ning >> >> >> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote: >>> you can see the nutch code. >>> >>> 2009/3/13 Mark Kerzner >>> >>>> Hi, >>>> >>>> How do I allow multiple nodes to write to the same index file in HDFS? >>>> >>>> Thank you, >>>> Mark >>>> >>> > >
Re: Creating Lucene index in Hadoop
Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote: > you can see the nutch code. > > 2009/3/13 Mark Kerzner > >> Hi, >> >> How do I allow multiple nodes to write to the same index file in HDFS? >> >> Thank you, >> Mark >> >
Re: how to optimize mapreduce procedure??
I would agree with Enis. MapReduce is good for batch building large indexes, but not for search which requires realtime response. Cheers, Ning On Fri, Mar 13, 2009 at 10:58 AM, Enis Soztutar wrote: > ZhiHong Fu wrote: >> >> Hello, >> >> I'm writing a program which will finish lucene searching in >> about 12 index directorys, all of them are stored in HDFS. It is done >> like this: >> 1. We get about 12 index Directorys through lucene index >> functionality, each of which about 100M size, >> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop >> cluster is made up of one namenode and five datanodes,totally 6 >> computers. >> 3. And then I will do lucene searching for these 12 index directorys, >> The mapreduce methods are as follows: >> Map Procedure: 12 index directory will be splitted into >> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will >> get 4 indexDirs and store them in an Intermediate Result. >> Combine Procedure: for a intermediate Result locally, we will do >> really lucene search in its containing index directory. and then store >> these hit result in the intermediate Result. >> Reduce Procedure: Reduce the Intermediate Results' hit result. and >> get the search Result. >> >> But when I implement like this, I have a performance problem, I set >> numOfMapTasks and numOfReduceTasks to any value,such as >> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will >> spend about 28 seconds, and Obviously It is unacceptable. >> So I'm confused whether I did wrong map-reduce procedure or set wrong >> num of map or reduce tasks. and generally where the overhead of >> mapreduce proceduce will take place. Any suggestion will be >> appreciated. >> Thanks. >> > > Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does > not fit into the problem of distributed search over several nodes. The > overhead of > staring a new job for every search is not acceptable. > You can use nutch distributed search or katta(not sure about the name) > for this. >
Re: Hello, world for Hadoop + Lucene
Sorry for the late reply. You can refer to the test case TestIndexUpdater.java as an example. It uses the index contrib to build a Lucene index and verifies by querying on the index built. Cheers, Ning On Wed, Jan 14, 2009 at 12:05 PM, John Howland wrote: > Howdy! > > Is there any sort of "Hello, world!" example for building a Lucene > index with Hadoop? I am looking through the source in contrib/index > and it is a bit beyond me at the moment. Alternatively, is there more > documentation related to the contrib/index example code? > > There seems to be a lot of information out on the tubes for how to do > distribute indices and query them (e.g. Katta). Nutch obviously also > comes up, but it is not clear to me how to come to grips with Nutch > and I'm not interested in web crawling. What I'm looking for is a > simple example for the hadoop/lucene newbie where you can take a > String or a Text object and index it as a document. If, > understandably, such an example does not exist, any pointers from the > experts would be appreciated. I don't care as much about real world > usage/performance, as I do about pedagogical code which can serve as a > base for learning, just to give me a toehold. > > Many thanks, > > John >
Re: Distributed Lucene - from hadoop contrib
On 8/12/08, Deepika Khera <[EMAIL PROTECTED]> wrote: > I was imagining the 2 concepts of i) using hadoop.contrib.index to index > documents ii) providing search in a distributed fashion, to be all in > one box. Ideally, yes. However, while it's good to use map/reduce when batch-building index, there is no consensus whether it'll be a good idea to serve index on HDFS. This is because of the poor performance of random reads in HDFS. On 8/14/08, Anoop Bhatti <[EMAIL PROTECTED]> wrote: > I'd like to know if I'm heading down the right path, so my questions are: > * Has anyone tried searching a distributed Lucene index using a method > like this before? It seems too easy. Are there any "gotchas" that I > should look out for as I scale up to more nodes and a larger index? > * Do you think that going ahead with this approach, which consists of > 1) creating a Lucene index using the hadoop.contrib.index code > (thanks, Ning!) and 2) leaving that index "in-place" on hdfs and > searching over it using the client code below, is a good approach? Yes, the code works on a single index shard. There is the performance concern described above. More importantly, as your index scales out, there will be multiple shards, and there are the challenges of load balance and fault tolerance, etc. > * What is the status of the bailey project? It seems to be working on > the same type of problem. Should I wait until that project comes out > with code? There is no timeline for Bailey right now. Ning
Re: Distributed Lucene - from hadoop contrib
> 1) Katta n Distributed Lucene are different projects though, right? Both > being based on kind of the same paradigm (Distributed Index)? The design of Katta and that of Distributed Lucene are quite different last time I checked. I pointed out the Katta project because you can find the code for Distributed Lucene there. > 2) So, I should be able to use the hadoop.contrib.index with HDFS. > Though, it would be much better if it is integrated with "Distributed > Lucene" or the "Katta project" as these are designed keeping the > structure and behavior of indexes in mind. Right? As described in the README file, hadoop.contrib.index uses map/reduce to build Lucene instances. It does not contain a component that serves queries. If that's not sufficient for you, you can check out the designs of Katta and Distributed Index and see which one suits your use better. Ning
Re: Distributed Lucene - from hadoop contrib
http://wiki.apache.org/hadoop/DistributedLucene and hadoop.contrib.index are two different things. For information on hadoop.contrib.index, see the README file in the package. I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene at http://katta.wiki.sourceforge.net/. Ning On 8/7/08, Deepika Khera <[EMAIL PROTECTED]> wrote: > Hey guys, > > I would appreciate any feedback on this > > Deepika > > -Original Message- > From: Deepika Khera [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 06, 2008 5:39 PM > To: core-user@hadoop.apache.org > Subject: Distributed Lucene - from hadoop contrib > > Hi, > > > > I am planning to use "distributed lucene" from hadoop.contrib.index for > indexing. Has anyone used this or tested it? Any issues or comments? > > > > I see that the design described is different from HDFS (Namenode is > stateless, stores no information regarding blocks for files, etc) . Does > anyone know how hard will it be to setup this kind of system or is there > something that can be reused. > > > > A reference link - > > > > http://wiki.apache.org/hadoop/DistributedLucene > > > > Thanks, > Deepika > >
Re: Hadoop: Multiple map reduce or some better way
You can build Lucene indexes using Hadoop Map/Reduce. See the index contrib package in the trunk. Or is it still not something you are looking for? Regards, Ning On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: > No, currently my requirement is to solve this problem by apache hadoop. I am > trying to build up this type of inverted index and then measure performance > criteria with respect to others. > > Thanks, > > > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > Are you implementing this for instruction or production? > > > > If production, why not use Lucene? > > > > > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > > > > > HI Amar , Theodore, Arun, > > > > > > Thanks for your reply. Actaully I am new to hadoop so cant figure out > > much. > > > I have written following code for inverted index. This code maps each > > word > > > from the document to its document id. > > > ex: apple file1 file123 > > > Main functions of the code are:- > > > > > > public class HadoopProgram extends Configured implements Tool { > > > public static class MapClass extends MapReduceBase > > > implements Mapper { > > > > > > private final static IntWritable one = new IntWritable(1); > > > private Text word = new Text(); > > > private Text doc = new Text(); > > > private long numRecords=0; > > > private String inputFile; > > > > > >public void configure(JobConf job){ > > > System.out.println("Configure function is called"); > > > inputFile = job.get("map.input.file"); > > > System.out.println("In conf the input file is"+inputFile); > > > } > > > > > > > > > public void map(LongWritable key, Text value, > > > OutputCollector output, > > > Reporter reporter) throws IOException { > > > String line = value.toString(); > > > StringTokenizer itr = new StringTokenizer(line); > > > doc.set(inputFile); > > > while (itr.hasMoreTokens()) { > > > word.set(itr.nextToken()); > > > output.collect(word,doc); > > > } > > > if(++numRecords%4==0){ > > >System.out.println("Finished processing of input > > file"+inputFile); > > > } > > > } > > > } > > > > > > /** > > >* A reducer class that just emits the sum of the input values. > > >*/ > > > public static class Reduce extends MapReduceBase > > > implements Reducer { > > > > > > // This works as K2, V2, K3, V3 > > > public void reduce(Text key, Iterator values, > > >OutputCollector output, > > >Reporter reporter) throws IOException { > > > int sum = 0; > > > Text dummy = new Text(); > > > ArrayList IDs = new ArrayList(); > > > String str; > > > > > > while (values.hasNext()) { > > > dummy = values.next(); > > > str = dummy.toString(); > > > IDs.add(str); > > >} > > >DocIDs dc = new DocIDs(); > > >dc.setListdocs(IDs); > > > output.collect(key,dc); > > > } > > > } > > > > > > public int run(String[] args) throws Exception { > > > System.out.println("Run function is called"); > > > JobConf conf = new JobConf(getConf(), WordCount.class); > > > conf.setJobName("wordcount"); > > > > > > // the keys are words (strings) > > > conf.setOutputKeyClass(Text.class); > > > > > > conf.setOutputValueClass(Text.class); > > > > > > > > > conf.setMapperClass(MapClass.class); > > > > > > conf.setReducerClass(Reduce.class); > > > } > > > > > > > > > Now I am getting output array from the reducer as:- > > > word \root\test\test123, \root\test12 > > > > > > In the next stage I want to stop 'stop words', scrub words etc. and > > like > > > position of the word in the document. How would I apply multiple maps or > > > multilevel map reduce jobs programmatically? I guess I need to make > > another > > > class or add some functions in it? I am not able to figure it out. > > > Any pointers for these type of problems? > > > > > > Thanks, > > > Aayush > > > > > > > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> > > wrote: > > > > > >> On Wed, 26 Mar 2008, Aayush Garg wrote: > > >> > > >>> HI, > > >>> I am developing the simple inverted index program frm the hadoop. My > > map > > >>> function has the output: > > >>> > > >>> and the reducer has: > > >>> > > >>> > > >>> Now I want to use one more mapreduce to remove stop and scrub words > > from > > >> Use distributed cache as Arun mentioned. > > >>> this output. Also in the next stage I would like to have short summay > > >> Whether to use a separate MR job depends on what exactly you mean by > > >> summary. If its like a window around the current word then you can > > >> possibly do it in one go. > > >> Amar > > >>> associated with every word. How should I design my program from this > > >> stage? > > >>> I mean how would I apply multiple map
Re: Nutch and Distributed Lucene
Hi, Nutch builds Lucene indexes. But Nutch is much more than that. It is a web search application software that crawls the web, inverts links and builds indexes. Each step is one or more Map/Reduce jobs. You can find more information at http://lucene.apache.org/nutch/ The Map/Reduce job to build Lucene indexes in Nutch is customized to the data schema/structures used in Nutch. The index contrib package in Hadoop provides a general/configurable process to build Lucene indexes in parallel using a Map/Reduce job. That's the main difference. There is also the difference that the index build job in Nutch builds indexes in reduce tasks, while the index contrib package builds indexes in both map and reduce tasks and there are advantages in doing that... Regards, Ning On 4/1/08, Naama Kraus <[EMAIL PROTECTED]> wrote: > Hi, > > I'd like to know if Nutch is running on top of Lucene, or is it non related > to Lucene. I.e. indexing, parsing, crawling, internal data structures ... - > all written from scratch using MapReduce (my impression) ? > > What is the relation between Nutch and the distributed Lucene patch that was > inserted lately into Hadoop ? > > Thanks for any enlightening, > Naama > > -- > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo > 00 oo 00 oo > "If you want your children to be intelligent, read them fairy tales. If you > want them to be more intelligent, read them more fairy tales." (Albert > Einstein) >
A contrib package to build/update a Lucene index
Hi, Is there any interest in a contrib package to build/update a Lucene index? I should have asked the question before creating the JIRA issue and attaching the patch. In any case, more details can be found at https://issues.apache.org/jira/browse/HADOOP-2951 Regards, Ning
Re: Lucene-based Distributed Index Leveraging Hadoop
We welcome your input. Discussions are mainly on [EMAIL PROTECTED] now (a thread with the same title). On 2/7/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > This is actually something we were planning on building into Nutch. > > Dennis
Re: Lucene-based Distributed Index Leveraging Hadoop
On 2/6/08, Ted Dunning <[EMAIL PROTECTED]> wrote: > Our best work-around is to simply take a shard out of service during delivery > of an updated index. This is obviously not a good solution. How many shard servers are serving each shard? If it's more than one, you can have the rest of the shard servers sharing the query workload while one shard server loads a new version of a shard. We'd like to start an open source project for this (or join if one already exists) if there is enough interest.
Lucene-based Distributed Index Leveraging Hadoop
new version of each Lucene instance will share as many files as possible as the previous version. ENSURING INDEX CONSISTENCY At any point in time, an index client always has a consistent view of the shards in the index. The results of a search query include either all or none of a recent update to the index. The details of the algorithm to accomplish this are omitted here, but the basic flow is pretty simple. After the Map/Reduce job to update the shards completes, the master will tell each shard server to "prepare" the new version of the index. After all the shard servers have responded affirmatively to the "prepare" message, the new index is ready to be queried. An index client will then lazily learn about the new index when it makes its next getShardLocations() call to the master. In essence, a lazy two-phase commit protocol is used, with "prepare" and "commit" messages piggybacked on heartbeats. After a shard has switched to the new index, the Lucene files in the old index that are no longer needed can safely be deleted. ACHIEVING FAULT-TOLERANCE We rely on the fault-tolerance of Map/Reduce to guarantee that an index update will eventually succeed. All shards are stored in HDFS and can be read by any shard server in a cluster. For a given shard, if one of its shard servers dies, new search requests are handled by its surviving shard servers. To ensure that there is always enough coverage for a shard, the master will instruct other shard servers to take over the shards of a dead shard server. PERFORMANCE ISSUES Currently, each shard server reads a shard directly from HDFS. Experiments have shown that this approach does not perform very well, with HDFS causing Lucene to slow down fairly dramatically (by well over 5x when data blocks are accessed over the network). Consequently, we are exploring different ways to leverage the fault tolerance of HDFS and, at the same time, work around its performance problems. One simple alternative is to add a local file system cache on each shard server. Another alternative is to modify HDFS so that an application has more control over where to store the primary and replicas of an HDFS block. This feature may be useful for other HDFS applications (e.g., HBase). We would like to collaborate with other people who are interested in adding this feature to HDFS. Regards, Ning Li