It takes me about 6 days to crawl, parse and index 5 million documents. If I did not create an incremental index and have 2000 crawls in different directories. What is the best way to merge them all into one index? I've started using the mergecrawls.sh but it's taking 2 weeks already and it's still not done. I've been monitoring the HD and there are no waits. CPU usage is at about 14%.
The server I'm doing the merges on is a 16 core, 32MB RAM, 30TB with ZFS on a SAN. Ideas anyone? Thanks in advance as always, Alex --- On Fri, 6/5/09, Raymond Balmès <[email protected]> wrote: > From: Raymond Balmès <[email protected]> > Subject: Re: Merge taking forever > To: [email protected] > Date: Friday, June 5, 2009, 2:38 AM > how long does it take for your 6 > millions URLs to be > crawled/parsed/indexed... I'm curious to know because I'm > about to shoot in > this area but I have no idea how long it will take. > > -Ray- > > 2009/6/5 John Martyniak <[email protected]> > > > Arkady, > > > > I think that is beauty of nutch I have built a index > of a little more 6 > > million urls with "out of the box" Nutch. I > would say that is pretty good > > for most situations before you have to start getting > into hadoop and > > multiple machines. > > > > -John > > > > > > On Jun 4, 2009, at 5:19 PM, <[email protected]> > wrote: > > > > Hi Andrzej, > >> > >> -----Original Message----- > >>> From: Andrzej Bialecki [mailto:[email protected]] > >>> Sent: Thursday, June 04, 2009 9:47 PM > >>> To: [email protected] > >>> Subject: Re: Merge taking forever > >>> > >>> Bartosz Gadzimski wrote: > >>> > >>>> As Arkadi said, your hdd is to slow for 2 > x quad core processor. I have > >>>> the same problem and now thinking of using > more boxes or very fast > >>>> drives (sas 15k). > >>>> > >>>> Raymond Balmčs pisze: > >>>> > >>>>> Well I suspect the sort function is > mono-threaded as usually they are > >>>>> > >>>> so > >>> > >>>> only one core is used 25% is the max you > will get. > >>>>> I have a dual core and it only goes to > 50% CPU in many of the steps ... > >>>>> > >>>> I > >>> > >>>> assumed that some phases are > mono-threaded. > >>>>> > >>>> > >>> Folks, > >>> > >>> From your conversation I suspect that you are > running Hadoop with > >>> LocalJobtracker, i.e. in a single JVM - > correct? > >>> > >>> While this works ok for small datasets, you > don't really benefit from > >>> map-reduce parallelism (and you still pay the > penalty for the > >>> overheads). As your dataset grows, you will > quickly reach the > >>> scalability limits - in this case, the limit > of IO throughput of a > >>> single drive, during the sort phase of a large > dataset. The excessive IO > >>> demands can be solved by distributing the load > (over many drives, and > >>> over many machines), which is what HDFS is > designed to do well. > >>> > >>> Hadoop tasks are usually single-threaded, and > additionally > >>> LocalJobTracker implements only a primitive > non-parallel model of task > >>> execution - i.e. each task is scheduled to run > sequentially in turn. If > >>> you run the regular distributed JobTracker, > Hadoop splits the load among > >>> many tasks running in parallel. > >>> > >>> So, the solution is this: set up a distributed > Hadoop cluster, even if > >>> it's going to consist of a single node - > because then the data will be > >>> split and processed in parallel by several JVM > instances. This will also > >>> help the operating system to schedule these > processes over multiple > >>> CPU-s. Additionally, if you still experience > IO contention, consider > >>> moving to HDFS as the filestystem, and spread > it over more than 1 > >>> machine and more than 1 disk in each machine. > >>> > >> > >> Thank you for these recommendations. > >> > >> I think that there is a large group of users > (perhaps limited by budget or > >> time they are willing to spend) that will give up > on trying to use Nutch > >> unless they can run it on a single box with simple > configuration. > >> > >> Regards, > >> > >> Arkadi > >> > >> > >>> -- > >>> Best regards, > >>> Andrzej Bialecki > <>< > >>> ___. ___ ___ ___ _ > _ __________________________________ > >>> [__ || __|__/|__||\/| Information > Retrieval, Semantic Web > >>> ___|||__|| \| || | > Embedded Unix, System Integration > >>> http://www.sigram.com Contact: info at sigram dot > com > >>> > >> > >> >
