Byron said:

> I don't think you have to refetch if you fix your
> segments, i believe it just chops off what it can't
> read and dumps a clean segment (ofcourse missing some
> data).
> 
> Have you tried merging your indexes into a single
> index and search from that?

Yes, and it turns out that it now works like a charm.
Queries are returned in a second or two. It turns out
that when I merged segments they didn't go where I
thought they did so I wasn't running queries against 
the merged one. Also, as you suggest, after fixing
the bad segment I didn't need to refetch.

> I personally use Resin, but even tomcat shouldn't be
> this slow. You say 300k sites indexed, was this done
> with crawl and a depth or how many urls make up your
> db and such?

Interesting -- I'll have to look into Resin. The web
site does look interesting.

Thanks greatly -- it is nice to have Nutch working
as I hoped it would. The bad segment was indeed
slowing down queries by a factor of 5 or maybe more.
There is a big smile on my face for having this solved.

      - Bill



> -byron
> 
> --- Bill Goffe <[EMAIL PROTECTED]> wrote:
> 
> > Byron said:
> > 
> > > Since your running Debian, can you confirm your
> > > java_home points to 1.4.2 and not Kaffe for both
> > Nutch
> > > & Tomcat?
> > 
> > Yes, sure of this (nothing with locate or which to
> > kaffe 
> > and the environment variable seems correct).
> >  
> > > If you have corruption, you may want to start
> > over. 
> > > My laptop runs quicker queries on 300k pages than
> > this
> > > server yields results.
> > 
> > Wow. FWIW I first tried bin/nutch segread -fix but
> > that didn't
> > fix the corrupted segment (has been another report
> > of that here,
> > even when I deleted all index files as was also
> > reported here). I
> > then tried bin/nutch segslice -fix and that indeed
> > worked
> > (created two segments, one of which had zero size
> > and the other
> > was fine. (Oh -- figured out if corrupted with
> > bin/nutch segread
> > -list).
> > 
> > But, even with bin/nutch segslice -fix it would seem
> > that
> > at least I would need to refetch -- is this correct?
> > 
> > > Was your crawl/fetch performing terribly as well
> > or
> > > just queries?
> > 
> > Hm. Not sure how the crawl fetch should go; took
> > about 24
> > hours for 300,000 sites (doubtless depends on the
> > files in
> > conf -- I can live with that speed). But, queries
> > take up to
> > 10-15 seconds.
> > 
> >     - Bill
> > 
> > > 
> > > -byron
> > > 
> > > --- Bill Goffe <[EMAIL PROTECTED]> wrote:
> > > 
> > > > Hello -
> > > > 
> > > > I'm experiencing slow searches. Here's the
> > > > specifics:
> > > >   - Search example:
> > > >
> > http://rfe.org/search.jsp?query=wealth+of+nations
> > > >     reliably takes 11 seconds
> > > >   - ~300K pages in the database (used mergesegs
> > w/
> > > > indexing on my three 
> > > >     segments; one was found partially corrupted)
> > > >   - Dual 2.80GHz Xeon machine with 3 gig RAM and
> > > > SCSI disks (hardware RAID?)
> > > >   - Nutch 0.7.1
> > > >   - JAVA_OPTS="-Xmx1024m -Xms512m" (doesn't seem
> > to
> > > > matter)
> > > >   - Tomcat 5.5.9 (minProcessors="5"
> > > > maxProcessors="75" in my connector
> > > >     for proxying in server.xml)
> > > >   - Java(TM) 2 SDK, Standard Edition Version
> > 1.4.2
> > > >   - Linux (Debian) with 2.4.27-2-686-smp kernel
> > > > 
> > > > When I monitor the search with htop (a _nice_
> > > > replacement for top -- much
> > > > easier to kill or renice jobs in it than top,
> > and
> > > > can easily view parent
> > > > and child processes and sort views different
> > ways) I
> > > > see 41 processes
> > > > (seems like a lot?) started by Tomcat. Memory
> > usage
> > > > for each goes to ~200M
> > > > after a search of the above from about 64K at
> > Tomcat
> > > > startup (even on a
> > > > single word search it goes to ~150M).
> > > > 
> > > > I didn't see anything obvious in
> > nutch-default.xml
> > > > to fiddle with nor
> > > > anything that really seemed apropos in the list
> > > > archive (other than others
> > > > seem to get much faster searches). Any
> > suggestions?
> > > > 
> > > >          - Bill
> > > > 
> > > > -- 
> > > >         
> > > >
> > >
> >
> *------------------------------------------------------*
> > > >          | Bill Goffe                
> > > > [EMAIL PROTECTED]          |
> > > >          | Department of Economics    voice:
> > (315)
> > > > 312-3444     |
> > > >          | SUNY Oswego                fax:  
> > (315)
> > > > 312-5444     |
> > > >          | 416 Mahar Hall            
> > > > <http://cook.rfe.org>     |          
> > > >          | Oswego, NY  13126                    
> >    
> > > >            |
> > > >
> > >
> >
> *--------*------------------------------------------------------*-----------*
> > > > | "Some predicted the disclosure would set off
> > > > strong reactions from        |
> > > > |  governments of the target countries."        
> >    
> > > >                        |
> > > > |   -- A description of how China, Russia, Iraq,
> > > > North Korea, Iran, Libya   |
> > > > |      and Syria might feel about the revelation
> > > > that the U.S. has          |
> > > > |      contingency plans to use nuclear weapons
> > > > against them. "U.S. Works   |
> > > > |      Up Plan for Using Nuclear Arms," Paul
> > > > Richter, LA Times,             |
> > > > |      March 9, 2002.                           
> >    
> > > >                        |
> > > >
> > >
> >
> *---------------------------------------------------------------------------*
> > > > 
> > > > 
> > 
> > -- 
> >         
> >
> *------------------------------------------------------*
> >          | Bill Goffe                
> > [EMAIL PROTECTED]          |
> >          | Department of Economics    voice: (315)
> > 312-3444     |
> >          | SUNY Oswego                fax:   (315)
> > 312-5444     |
> >          | 416 Mahar Hall            
> > <http://cook.rfe.org>     |          
> >          | Oswego, NY  13126                        
> >            |
> >
> *--------*------------------------------------------------------*-----------*
> > | "Students without a bedroom television scored an
> > average of about 63 on   |
> > | the mathematics section of the test, while
> > students with a bedroom TV     |
> > | scored an average of about 53 (P<0.001)."         
> >                        |
> > |  -- A study on the impact of TVs in the bedrooms
> > of third graders.        |
> > |     "Bedroom TV Associated With Lower Achievement
> > Scores," Jeff Minerd,   |
> > |    
> >
> http://www.medpagetoday.com/Pediatrics/Parenting/tb/1303>.
> >            |
> >
> *---------------------------------------------------------------------------*
> > 
> > 

-- 
         *------------------------------------------------------*
         | Bill Goffe                 [EMAIL PROTECTED]          |
         | Department of Economics    voice: (315) 312-3444     |
         | SUNY Oswego                fax:   (315) 312-5444     |
         | 416 Mahar Hall             <http://cook.rfe.org>     |          
         | Oswego, NY  13126                                    |
*--------*------------------------------------------------------*-----------*
| "Civilization has been built on genetically modified plants."             |
|   -- Nina V. Fedoroff, the Evan Pugh Professor of Biology and Willaman    |
|      Professor of Life Science at Penn State University. She was          |
|      commenting on the finding, reported in "Science," that ancient       |
|      Mexican were breeding corn for desirable traits 5,500 years ago.     |
|      She went on to say that the whole world eats genetically modified    |
|      foods -- starting thousands of years ago, "that rice in China, wheat |
|      in the Middle East and corn in Mexico were all genetically altered   |
|      through selective cultivation." (quotes from the article, not her).  |
|      "Study: Ancients Manipulated Corn Genes," Associated Press           |
|      www.nytimes.com/aponline/science/AP-Genetically-Modified-Corn.html   |
*---------------------------------------------------------------------------*



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.  Get Certified Today
Register for a JBoss Training Course.  Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to