Re: Creating Lucene index in Hadoop

2009-03-17 Thread Ning Li
> Lucene on a local disk benefits significantly from the local filesystem's
> RAM cache (aka the kernel's buffer cache).  HDFS has no such local RAM cache
> outside of the stream's buffer.  The cache would need to be no larger than
> the kernel's buffer cache to get an equivalent hit ratio.  And if you're

If the two cache sizes are the same, then yes. Just that local FS
cache size is adjusted (more?) dynamically.


Cheers,
Ning


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?
  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...

Cheers,
Ning


On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting  wrote:
> Ning Li wrote:
>>
>> With
>> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
>> become feasible to search on HDFS directly.
>
> I don't think HADOOP-4801 is required.  It would help, certainly, but it's
> so fraught with security and other issues that I doubt it will be committed
> anytime soon.
>
> What would probably help HDFS random access performance for Lucene
> significantly would be:
>  1. A cache of connections to datanodes, so that each seek() does not
> require an open().  If we move HDFS data transfer to be RPC-based (see,
> e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
> for free, since RPC already caches connections.  We hope to do this for
> Hadoop 1.0, so that we use a single transport for all Hadoop's core
> operations, to simplify security.
>  2. A local cache of read-only HDFS data, equivalent to kernel's buffer
> cache.  This might be implemented as a Lucene Directory that keeps an LRU
> cache of buffers from a wrapped filesystem, perhaps a subclass of
> RAMDirectory.
>
> With these, performance would still be slower than a local drive, but
> perhaps not so dramatically.
>
> Doug
>


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
> I'm missing why you would ever want the Lucene index in HDFS for
> reading.

The Lucene indexes are written to HDFS, but that does not mean you
conduct search on the indexes stored in HDFS directly. HDFS is not
designed for random access. Usually the indexes are copied to the
nodes where search will be served. With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

Cheers,
Ning


On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff  wrote:
>
> Does anyone have stats on how multiple readers on an optimized Lucene
> index in HDFS compares with a ParallelMultiReader (or whatever its
> called) over RPC on a local filesystem?
>
> I'm missing why you would ever want the Lucene index in HDFS for
> reading.
>
> Ian
>
> Ning Li  writes:
>
>> I should have pointed out that Nutch index build and contrib/index
>> targets different applications. The latter is for applications who
>> simply want to build Lucene index from a set of documents - e.g. no
>> link analysis.
>>
>> As to writing Lucene indexes, both work the same way - write the final
>> results to local file system and then copy to HDFS. In contrib/index,
>> the intermediate results are in memory and not written to HDFS.
>>
>> Hope it clarifies things.
>>
>> Cheers,
>> Ning
>>
>>
>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff  wrote:
>>>
>>> I understand why you would index in the reduce phase, because the anchor
>>> text gets shuffled to be next to the document.  However, when you index
>>> in the map phase, don't you just have to reindex later?
>>>
>>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>>> indexes because of how Lucene works.  The simple approach is to write
>>> your index outside of HDFS in the reduce phase, and then merge the
>>> indexes from each reducer manually.
>>>
>>> Ian
>>>
>>> Ning Li  writes:
>>>
>>>> Or you can check out the index contrib. The difference of the two is that:
>>>>   - In Nutch's indexing map/reduce job, indexes are built in the
>>>> reduce phase. Afterwards, they are merged into smaller number of
>>>> shards if necessary. The last time I checked, the merge process does
>>>> not use map/reduce.
>>>>   - In contrib/index, small indexes are built in the map phase. They
>>>> are merged into the desired number of shards in the reduce phase. In
>>>> addition, they can be merged into existing shards.
>>>>
>>>> Cheers,
>>>> Ning
>>>>
>>>>
>>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
>>>>> you can see the nutch code.
>>>>>
>>>>> 2009/3/13 Mark Kerzner 
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>>>
>>>>>> Thank you,
>>>>>> Mark
>>>>>>
>>>>>
>>>
>>>
>
>


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.

Cheers,
Ning


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff  wrote:
>
> I understand why you would index in the reduce phase, because the anchor
> text gets shuffled to be next to the document.  However, when you index
> in the map phase, don't you just have to reindex later?
>
> The main point to the OP is that HDFS is a bad FS for writing Lucene
> indexes because of how Lucene works.  The simple approach is to write
> your index outside of HDFS in the reduce phase, and then merge the
> indexes from each reducer manually.
>
> Ian
>
> Ning Li  writes:
>
>> Or you can check out the index contrib. The difference of the two is that:
>>   - In Nutch's indexing map/reduce job, indexes are built in the
>> reduce phase. Afterwards, they are merged into smaller number of
>> shards if necessary. The last time I checked, the merge process does
>> not use map/reduce.
>>   - In contrib/index, small indexes are built in the map phase. They
>> are merged into the desired number of shards in the reduce phase. In
>> addition, they can be merged into existing shards.
>>
>> Cheers,
>> Ning
>>
>>
>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
>>> you can see the nutch code.
>>>
>>> 2009/3/13 Mark Kerzner 
>>>
>>>> Hi,
>>>>
>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>
>
>


Re: Creating Lucene index in Hadoop

2009-03-13 Thread Ning Li
Or you can check out the index contrib. The difference of the two is that:
  - In Nutch's indexing map/reduce job, indexes are built in the
reduce phase. Afterwards, they are merged into smaller number of
shards if necessary. The last time I checked, the merge process does
not use map/reduce.
  - In contrib/index, small indexes are built in the map phase. They
are merged into the desired number of shards in the reduce phase. In
addition, they can be merged into existing shards.

Cheers,
Ning


On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
> you can see the nutch code.
>
> 2009/3/13 Mark Kerzner 
>
>> Hi,
>>
>> How do I allow multiple nodes to write to the same index file in HDFS?
>>
>> Thank you,
>> Mark
>>
>


Re: how to optimize mapreduce procedure??

2009-03-13 Thread Ning Li
I would agree with Enis. MapReduce is good for batch building large
indexes, but not for search which requires realtime response.

Cheers,
Ning


On Fri, Mar 13, 2009 at 10:58 AM, Enis Soztutar  wrote:
> ZhiHong Fu wrote:
>>
>> Hello,
>>
>>           I'm writing a program which will finish lucene searching in
>> about 12 index directorys, all of them are stored in HDFS. It is done
>> like this:
>> 1. We get about 12 index Directorys through lucene index
>> functionality, each of which about 100M size,
>> 2. We store these 12 index directorys on hadoop HDFS , and this hadoop
>> cluster is made up of one namenode and five datanodes,totally 6
>> computers.
>> 3. And then I will do lucene searching for these 12 index directorys,
>> The mapreduce methods are as follows:
>>    Map Procedure: 12 index directory will be splitted into
>> numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
>> get 4 indexDirs and store them in an Intermediate Result.
>>    Combine Procedure: for a intermediate Result locally, we will do
>> really lucene search in its containing index directory. and then store
>> these hit result in the intermediate Result.
>>    Reduce Procedure: Reduce the Intermediate Results' hit result. and
>> get the search Result.
>>
>> But when I implement like this, I have a performance problem, I set
>> numOfMapTasks and numOfReduceTasks to any value,such as
>> numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
>> spend about 28 seconds, and Obviously It is unacceptable.
>> So I'm confused whether I did wrong map-reduce procedure or set wrong
>> num of map or reduce tasks. and generally where the overhead of
>> mapreduce proceduce will take place. Any suggestion will be
>> appreciated.
>> Thanks.
>>
>
> Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does
> not fit into the problem of distributed search over several nodes. The
> overhead of
> staring a new job for every search is not acceptable.
> You can use nutch distributed search or katta(not sure about the name)
> for this.
>


Re: Hello, world for Hadoop + Lucene

2009-03-13 Thread Ning Li
Sorry for the late reply. You can refer to the test case
TestIndexUpdater.java as an example. It uses the index contrib to
build a Lucene index and verifies by querying on the index built.

Cheers,
Ning


On Wed, Jan 14, 2009 at 12:05 PM, John Howland  wrote:
> Howdy!
>
> Is there any sort of "Hello, world!" example for building a Lucene
> index with Hadoop? I am looking through the source in contrib/index
> and it is a bit beyond me at the moment. Alternatively, is there more
> documentation related to the contrib/index example code?
>
> There seems to be a lot of information out on the tubes for how to do
> distribute indices and query them (e.g. Katta). Nutch obviously also
> comes up, but it is not clear to me how to come to grips with Nutch
> and I'm not interested in web crawling. What I'm looking for is a
> simple example for the hadoop/lucene newbie where you can take a
> String or a Text object and index it as a document. If,
> understandably, such an example does not exist, any pointers from the
> experts would be appreciated. I don't care as much about real world
> usage/performance, as I do about pedagogical code which can serve as a
> base for learning, just to give me a toehold.
>
> Many thanks,
>
> John
>


Re: Distributed Lucene - from hadoop contrib

2008-08-18 Thread Ning Li
On 8/12/08, Deepika Khera <[EMAIL PROTECTED]> wrote:
> I was imagining the 2 concepts of i) using hadoop.contrib.index to index
> documents ii) providing search in a distributed fashion, to be all in
> one box.

Ideally, yes. However, while it's good to use map/reduce when
batch-building index, there is no consensus whether it'll be a good
idea to serve index on HDFS. This is because of the poor performance
of random reads in HDFS.


On 8/14/08, Anoop Bhatti <[EMAIL PROTECTED]> wrote:
> I'd like to know if I'm heading down the right path, so my questions are:
> * Has anyone tried searching a distributed Lucene index using a method
> like this before?  It seems too easy.  Are there any "gotchas" that I
> should look out for as I scale up to more nodes and a larger index?
> * Do you think that going ahead with this approach, which consists of
> 1) creating a Lucene index using the  hadoop.contrib.index code
> (thanks, Ning!) and 2) leaving that index "in-place" on hdfs and
> searching over it using the client code below, is a good approach?

Yes, the code works on a single index shard. There is the performance
concern described above. More importantly, as your index scales out,
there will be multiple shards, and there are the challenges of load
balance and fault tolerance, etc.


> * What is the status of the bailey project?  It seems to be working on
> the same type of problem. Should I wait until that project comes out
> with code?

There is no timeline for Bailey right now.

Ning


Re: Distributed Lucene - from hadoop contrib

2008-08-08 Thread Ning Li
> 1) Katta n Distributed Lucene are different projects though, right? Both
> being based on kind of the same paradigm (Distributed Index)?

The design of Katta and that of Distributed Lucene are quite different
last time I checked. I pointed out the Katta project because you can
find the code for Distributed Lucene there.

> 2) So, I should be able to use the hadoop.contrib.index with HDFS.
> Though, it would be much better if it is integrated with "Distributed
> Lucene" or the "Katta project" as these are designed keeping the
> structure and behavior of indexes in mind. Right?

As described in the README file, hadoop.contrib.index uses map/reduce
to build Lucene instances. It does not contain a component that serves
queries. If that's not sufficient for you, you can check out the
designs of Katta and Distributed Index and see which one suits your
use better.

Ning


Re: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Ning Li
http://wiki.apache.org/hadoop/DistributedLucene
and hadoop.contrib.index are two different things.

For information on hadoop.contrib.index, see the README file in the package.

I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene
at http://katta.wiki.sourceforge.net/.

Ning


On 8/7/08, Deepika Khera <[EMAIL PROTECTED]> wrote:
> Hey guys,
>
> I would appreciate any feedback on this
>
> Deepika
>
> -Original Message-
> From: Deepika Khera [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 06, 2008 5:39 PM
> To: core-user@hadoop.apache.org
> Subject: Distributed Lucene - from hadoop contrib
>
> Hi,
>
>
>
> I am planning to use "distributed lucene" from hadoop.contrib.index for
> indexing. Has anyone used this or tested it? Any issues or comments?
>
>
>
> I see that the design described is different from HDFS (Namenode is
> stateless, stores no information regarding blocks for files, etc) . Does
> anyone know how hard will it be to setup this kind of system or is there
> something that can be reused.
>
>
>
> A reference link -
>
>
>
> http://wiki.apache.org/hadoop/DistributedLucene
>
>
>
> Thanks,
> Deepika
>
>


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ning Li
You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?

Regards,
Ning

On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> trying to build up this type of inverted index and then measure performance
> criteria with respect to others.
>
> Thanks,
>
>
> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> >
> > Are you implementing this for instruction or production?
> >
> > If production, why not use Lucene?
> >
> >
> > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> >
> > > HI  Amar , Theodore, Arun,
> > >
> > > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> > much.
> > > I have written following code for inverted index. This code maps each
> > word
> > > from the document to its document id.
> > > ex: apple file1 file123
> > > Main functions of the code are:-
> > >
> > > public class HadoopProgram extends Configured implements Tool {
> > > public static class MapClass extends MapReduceBase
> > > implements Mapper {
> > >
> > > private final static IntWritable one = new IntWritable(1);
> > > private Text word = new Text();
> > > private Text doc = new Text();
> > > private long numRecords=0;
> > > private String inputFile;
> > >
> > >public void configure(JobConf job){
> > > System.out.println("Configure function is called");
> > > inputFile = job.get("map.input.file");
> > > System.out.println("In conf the input file is"+inputFile);
> > > }
> > >
> > >
> > > public void map(LongWritable key, Text value,
> > > OutputCollector output,
> > > Reporter reporter) throws IOException {
> > >   String line = value.toString();
> > >   StringTokenizer itr = new StringTokenizer(line);
> > >   doc.set(inputFile);
> > >   while (itr.hasMoreTokens()) {
> > > word.set(itr.nextToken());
> > > output.collect(word,doc);
> > >   }
> > >   if(++numRecords%4==0){
> > >System.out.println("Finished processing of input
> > file"+inputFile);
> > >  }
> > > }
> > >   }
> > >
> > >   /**
> > >* A reducer class that just emits the sum of the input values.
> > >*/
> > >   public static class Reduce extends MapReduceBase
> > > implements Reducer {
> > >
> > >   // This works as K2, V2, K3, V3
> > > public void reduce(Text key, Iterator values,
> > >OutputCollector output,
> > >Reporter reporter) throws IOException {
> > >   int sum = 0;
> > >   Text dummy = new Text();
> > >   ArrayList IDs = new ArrayList();
> > >   String str;
> > >
> > >   while (values.hasNext()) {
> > >  dummy = values.next();
> > >  str = dummy.toString();
> > >  IDs.add(str);
> > >}
> > >DocIDs dc = new DocIDs();
> > >dc.setListdocs(IDs);
> > >   output.collect(key,dc);
> > > }
> > >   }
> > >
> > >  public int run(String[] args) throws Exception {
> > >   System.out.println("Run function is called");
> > > JobConf conf = new JobConf(getConf(), WordCount.class);
> > > conf.setJobName("wordcount");
> > >
> > > // the keys are words (strings)
> > > conf.setOutputKeyClass(Text.class);
> > >
> > > conf.setOutputValueClass(Text.class);
> > >
> > >
> > > conf.setMapperClass(MapClass.class);
> > >
> > > conf.setReducerClass(Reduce.class);
> > > }
> > >
> > >
> > > Now I am getting output array from the reducer as:-
> > > word \root\test\test123, \root\test12
> > >
> > > In the next stage I want to stop 'stop  words',  scrub words etc. and
> > like
> > > position of the word in the document. How would I apply multiple maps or
> > > multilevel map reduce jobs programmatically? I guess I need to make
> > another
> > > class or add some functions in it? I am not able to figure it out.
> > > Any pointers for these type of problems?
> > >
> > > Thanks,
> > > Aayush
> > >
> > >
> > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > >>
> > >>> HI,
> > >>> I am developing the simple inverted index program frm the hadoop. My
> > map
> > >>> function has the output:
> > >>> 
> > >>> and the reducer has:
> > >>> 
> > >>>
> > >>> Now I want to use one more mapreduce to remove stop and scrub words
> > from
> > >> Use distributed cache as Arun mentioned.
> > >>> this output. Also in the next stage I would like to have short summay
> > >> Whether to use a separate MR job depends on what exactly you mean by
> > >> summary. If its like a window around the current word then you can
> > >> possibly do it in one go.
> > >> Amar
> > >>> associated with every word. How should I design my program from this
> > >> stage?
> > >>> I mean how would I apply multiple map

Re: Nutch and Distributed Lucene

2008-04-01 Thread Ning Li
Hi,

Nutch builds Lucene indexes. But Nutch is much more than that. It is a
web search application software that crawls the web, inverts links and
builds indexes. Each step is one or more Map/Reduce jobs. You can find
more information at http://lucene.apache.org/nutch/

The Map/Reduce job to build Lucene indexes in Nutch is customized to
the data schema/structures used in Nutch. The index contrib package in
Hadoop provides a general/configurable process to build Lucene indexes
in parallel using a Map/Reduce job. That's the main difference. There
is also the difference that the index build job in Nutch builds
indexes in reduce tasks, while the index contrib package builds
indexes in both map and reduce tasks and there are advantages in doing
that...

Regards,
Ning


On 4/1/08, Naama Kraus <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'd like to know if Nutch is running on top of Lucene, or is it non related
> to Lucene. I.e. indexing, parsing, crawling, internal data structures ... -
> all written from scratch using MapReduce (my impression) ?
>
> What is the relation between Nutch and the distributed Lucene patch that was
> inserted lately into Hadoop ?
>
> Thanks for any enlightening,
> Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>


A contrib package to build/update a Lucene index

2008-03-10 Thread Ning Li
Hi,

Is there any interest in a contrib package to build/update a Lucene index?

I should have asked the question before creating the JIRA issue and
attaching the patch. In any case, more details can be found at

https://issues.apache.org/jira/browse/HADOOP-2951


Regards,
Ning


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Ning Li
We welcome your input. Discussions are mainly on
[EMAIL PROTECTED] now (a thread with the same title).

On 2/7/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> This is actually something we were planning on building into Nutch.
>
> Dennis


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
On 2/6/08, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Our best work-around is to simply take a shard out of service during delivery
> of an updated index.  This is obviously not a good solution.

How many shard servers are serving each shard? If it's more than one,
you can have the rest of the shard servers sharing the query workload
while one shard server loads a new version of a shard.

We'd like to start an open source project for this (or join if one
already exists) if there is enough interest.


Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
new version
of each Lucene instance will share as many files as possible as the previous
version.

ENSURING INDEX CONSISTENCY
At any point in time, an index client always has a consistent view of the
shards in the index. The results of a search query include either all or none
of a recent update to the index. The details of the algorithm to accomplish
this are omitted here, but the basic flow is pretty simple.

After the Map/Reduce job to update the shards completes, the master will tell
each shard server to "prepare" the new version of the index. After all the
shard servers have responded affirmatively to the "prepare" message, the new
index is ready to be queried. An index client will then lazily learn about
the new index when it makes its next getShardLocations() call to the master.
In essence, a lazy two-phase commit protocol is used, with "prepare" and
"commit" messages piggybacked on heartbeats. After a shard has switched to
the new index, the Lucene files in the old index that are no longer needed
can safely be deleted.

ACHIEVING FAULT-TOLERANCE
We rely on the fault-tolerance of Map/Reduce to guarantee that an index update
will eventually succeed. All shards are stored in HDFS and can be read by any
shard server in a cluster. For a given shard, if one of its shard servers dies,
new search requests are handled by its surviving shard servers. To ensure that
there is always enough coverage for a shard, the master will instruct other
shard servers to take over the shards of a dead shard server.

PERFORMANCE ISSUES
Currently, each shard server reads a shard directly from HDFS. Experiments
have shown that this approach does not perform very well, with HDFS causing
Lucene to slow down fairly dramatically (by well over 5x when data blocks are
accessed over the network). Consequently, we are exploring different ways to
leverage the fault tolerance of HDFS and, at the same time, work around its
performance problems. One simple alternative is to add a local file system
cache on each shard server. Another alternative is to modify HDFS so that an
application has more control over where to store the primary and replicas of
an HDFS block. This feature may be useful for other HDFS applications (e.g.,
HBase). We would like to collaborate with other people who are interested in
adding this feature to HDFS.


Regards,
Ning Li