Re: how to combine two run's result for search

Dennis Kubes Tue, 05 Sep 2006 18:54:59 -0700

Are those like the shuttle boards?  Smaller 1/4 size boxes?


Dennis

Zaheed Haque wrote:

Renaud:

Yes or No!. I have done some testing as Dennis Kubes suggested and got
similler results like his test. In short having 4 nutch search servers
in one box but in 4 different disks with in my case 0.75 mil docs per
disk. I had about 4 gig memory and 1 AMD 64 processor and it worked
out rather ok. I need to do more testing to fine tune this cos this
really brings the issue of cost. I have also thought about doing some
testing with VIA EPIA boards. Maybe in the future :-)

The problem I encountered is more this

http://issues.apache.org/jira/browse/NUTCH-92

but this will be solved sooner or later just a matter of time.

Cheers


On 9/5/06, Renaud Richardet <[EMAIL PROTECTED]> wrote:

Zaheed,

Thank you, that works good. Do you know if there is a big performance
overhead with starting 2 servers? As an alternative, we could use
Lucene's Multisearcher?

-- Renaud


Zaheed Haque wrote:
> Hi:
>
> Assuming you have
>
> index 1 at /data/crawl1
> index 2 at /data/crawl2
>
> In nutch-site.xml
> searcher.dir = /data
>
> Under /data you have a text file called search-server.txt (I think do
> check nutch-site search.dir description please)
>
> In the text file you will have the following
>
> hostname1 portnumber
> hostname2 portnumber
>
> example
> localhost 1234
> localhost 5678
>
> Then you need to start
>
> bin/nutch server 1234 /data/craw1 &
>
> and
>
> bin/nutch server 5678 /data/crawl2 &
>
> now try
>
> bin/nutch org.apache.nutch.search.NutchBean www
>
> you should see results :-)
>
> Cheers
>
> On 9/5/06, Renaud Richardet <[EMAIL PROTECTED]> wrote:
>> @Dennis,
>> Can you explain how to setup distributed search while storing the 2
>> indexes on the same local machine (if possible)?
>>
>> @Feng,

>> We created a shell script to merge 2 runs, let us know if thatworks for

>> you.
>> http://wiki.apache.org/nutch/MergeCrawl
>>
>> Renaud
>>
>>
>> Dennis Kubes wrote:
>> > You can keep the indexes separate and use the distributed search
>> > server, one per index or you can use the mergedb and mergesegs
>> > commands to merge the two runs into a single crawldb and a single
>> > segments then re-run the invertlinks and index to create a single
>> > index file which can then be searched.
>> >
>> > Dennis
>> >
>> > Feng Ji wrote:
>> >> Hi there,
>> >>
>> >> In Nutch 08, I have crawled down from two webDB independently.
>> >>

>> >> For each run, I did invertlinks and index. So each one issearchable.

>> >>
>> >> Now I want to combine them togeter for search. I tried "merge"
>> >> command to

>> >> merge two indexes, but the search for the result index outputdir is

>> >> dull.

>> >> Do I need put output dir to the same directory as above twocrawl/ ?

>> >>
>> >> I wonder what is proper steps to combine two seperate run into one
>> >> search
>> >> result. Do I need to combine two webdb, merge two segments and do
>> >> invertlinks and do index?
>> >>
>> >> thanks your time,
>> >>
>> >> Michael,
>> >>
>> >
>>
>> --
>> Renaud Richardet
>> COO America
>> Wyona    -   Open Source Content Management   -   Apache Lenya
>> office +1 857 776-3195                  mobile +1 617 230 9112
>> renaud.richardet <at> wyona.com           http://www.wyona.com
>>
>>
>

--
Renaud Richardet
COO America
Wyona    -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                  mobile +1 617 230 9112
renaud.richardet <at> wyona.com           http://www.wyona.com

Re: how to combine two run's result for search

Reply via email to