Byron said:
> I don't think you have to refetch if you fix your
> segments, i believe it just chops off what it can't
> read and dumps a clean segment (ofcourse missing some
> data).
>
> Have you tried merging your indexes into a single
> index and search from that?
Yes, and it turns out that it now works like a charm.
Queries are returned in a second or two. It turns out
that when I merged segments they didn't go where I
thought they did so I wasn't running queries against
the merged one. Also, as you suggest, after fixing
the bad segment I didn't need to refetch.
> I personally use Resin, but even tomcat shouldn't be
> this slow. You say 300k sites indexed, was this done
> with crawl and a depth or how many urls make up your
> db and such?
Interesting -- I'll have to look into Resin. The web
site does look interesting.
Thanks greatly -- it is nice to have Nutch working
as I hoped it would. The bad segment was indeed
slowing down queries by a factor of 5 or maybe more.
There is a big smile on my face for having this solved.
- Bill
> -byron
>
> --- Bill Goffe <[EMAIL PROTECTED]> wrote:
>
> > Byron said:
> >
> > > Since your running Debian, can you confirm your
> > > java_home points to 1.4.2 and not Kaffe for both
> > Nutch
> > > & Tomcat?
> >
> > Yes, sure of this (nothing with locate or which to
> > kaffe
> > and the environment variable seems correct).
> >
> > > If you have corruption, you may want to start
> > over.
> > > My laptop runs quicker queries on 300k pages than
> > this
> > > server yields results.
> >
> > Wow. FWIW I first tried bin/nutch segread -fix but
> > that didn't
> > fix the corrupted segment (has been another report
> > of that here,
> > even when I deleted all index files as was also
> > reported here). I
> > then tried bin/nutch segslice -fix and that indeed
> > worked
> > (created two segments, one of which had zero size
> > and the other
> > was fine. (Oh -- figured out if corrupted with
> > bin/nutch segread
> > -list).
> >
> > But, even with bin/nutch segslice -fix it would seem
> > that
> > at least I would need to refetch -- is this correct?
> >
> > > Was your crawl/fetch performing terribly as well
> > or
> > > just queries?
> >
> > Hm. Not sure how the crawl fetch should go; took
> > about 24
> > hours for 300,000 sites (doubtless depends on the
> > files in
> > conf -- I can live with that speed). But, queries
> > take up to
> > 10-15 seconds.
> >
> > - Bill
> >
> > >
> > > -byron
> > >
> > > --- Bill Goffe <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hello -
> > > >
> > > > I'm experiencing slow searches. Here's the
> > > > specifics:
> > > > - Search example:
> > > >
> > http://rfe.org/search.jsp?query=wealth+of+nations
> > > > reliably takes 11 seconds
> > > > - ~300K pages in the database (used mergesegs
> > w/
> > > > indexing on my three
> > > > segments; one was found partially corrupted)
> > > > - Dual 2.80GHz Xeon machine with 3 gig RAM and
> > > > SCSI disks (hardware RAID?)
> > > > - Nutch 0.7.1
> > > > - JAVA_OPTS="-Xmx1024m -Xms512m" (doesn't seem
> > to
> > > > matter)
> > > > - Tomcat 5.5.9 (minProcessors="5"
> > > > maxProcessors="75" in my connector
> > > > for proxying in server.xml)
> > > > - Java(TM) 2 SDK, Standard Edition Version
> > 1.4.2
> > > > - Linux (Debian) with 2.4.27-2-686-smp kernel
> > > >
> > > > When I monitor the search with htop (a _nice_
> > > > replacement for top -- much
> > > > easier to kill or renice jobs in it than top,
> > and
> > > > can easily view parent
> > > > and child processes and sort views different
> > ways) I
> > > > see 41 processes
> > > > (seems like a lot?) started by Tomcat. Memory
> > usage
> > > > for each goes to ~200M
> > > > after a search of the above from about 64K at
> > Tomcat
> > > > startup (even on a
> > > > single word search it goes to ~150M).
> > > >
> > > > I didn't see anything obvious in
> > nutch-default.xml
> > > > to fiddle with nor
> > > > anything that really seemed apropos in the list
> > > > archive (other than others
> > > > seem to get much faster searches). Any
> > suggestions?
> > > >
> > > > - Bill
> > > >
> > > > --
> > > >
> > > >
> > >
> >
> *------------------------------------------------------*
> > > > | Bill Goffe
> > > > [EMAIL PROTECTED] |
> > > > | Department of Economics voice:
> > (315)
> > > > 312-3444 |
> > > > | SUNY Oswego fax:
> > (315)
> > > > 312-5444 |
> > > > | 416 Mahar Hall
> > > > <http://cook.rfe.org> |
> > > > | Oswego, NY 13126
> >
> > > > |
> > > >
> > >
> >
> *--------*------------------------------------------------------*-----------*
> > > > | "Some predicted the disclosure would set off
> > > > strong reactions from |
> > > > | governments of the target countries."
> >
> > > > |
> > > > | -- A description of how China, Russia, Iraq,
> > > > North Korea, Iran, Libya |
> > > > | and Syria might feel about the revelation
> > > > that the U.S. has |
> > > > | contingency plans to use nuclear weapons
> > > > against them. "U.S. Works |
> > > > | Up Plan for Using Nuclear Arms," Paul
> > > > Richter, LA Times, |
> > > > | March 9, 2002.
> >
> > > > |
> > > >
> > >
> >
> *---------------------------------------------------------------------------*
> > > >
> > > >
> >
> > --
> >
> >
> *------------------------------------------------------*
> > | Bill Goffe
> > [EMAIL PROTECTED] |
> > | Department of Economics voice: (315)
> > 312-3444 |
> > | SUNY Oswego fax: (315)
> > 312-5444 |
> > | 416 Mahar Hall
> > <http://cook.rfe.org> |
> > | Oswego, NY 13126
> > |
> >
> *--------*------------------------------------------------------*-----------*
> > | "Students without a bedroom television scored an
> > average of about 63 on |
> > | the mathematics section of the test, while
> > students with a bedroom TV |
> > | scored an average of about 53 (P<0.001)."
> > |
> > | -- A study on the impact of TVs in the bedrooms
> > of third graders. |
> > | "Bedroom TV Associated With Lower Achievement
> > Scores," Jeff Minerd, |
> > |
> >
> http://www.medpagetoday.com/Pediatrics/Parenting/tb/1303>.
> > |
> >
> *---------------------------------------------------------------------------*
> >
> >
--
*------------------------------------------------------*
| Bill Goffe [EMAIL PROTECTED] |
| Department of Economics voice: (315) 312-3444 |
| SUNY Oswego fax: (315) 312-5444 |
| 416 Mahar Hall <http://cook.rfe.org> |
| Oswego, NY 13126 |
*--------*------------------------------------------------------*-----------*
| "Civilization has been built on genetically modified plants." |
| -- Nina V. Fedoroff, the Evan Pugh Professor of Biology and Willaman |
| Professor of Life Science at Penn State University. She was |
| commenting on the finding, reported in "Science," that ancient |
| Mexican were breeding corn for desirable traits 5,500 years ago. |
| She went on to say that the whole world eats genetically modified |
| foods -- starting thousands of years ago, "that rice in China, wheat |
| in the Middle East and corn in Mexico were all genetically altered |
| through selective cultivation." (quotes from the article, not her). |
| "Study: Ancients Manipulated Corn Genes," Associated Press |
| www.nytimes.com/aponline/science/AP-Genetically-Modified-Corn.html |
*---------------------------------------------------------------------------*
-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc. Get Certified Today
Register for a JBoss Training Course. Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general