Re: Best performance approach for single MP machine?

2006-07-21 Thread Doug Cook
Thanks, Håvard (and Doug, in the original email). Those pointers, plus a few other tips from elsewhere, did the trick. I'm now up and running with all CPUs. One thing I found along the way was that if I did not set mapred.child.heap.size, I would run out of heap space in initialization of

Nutch with Domino web server

2006-07-21 Thread Deepa Devanathan
hi guys, I tried crawling my site which works with a Domino web server talking to a Tomcat - using the crawl command ( with all the config for urls, file-types etc etc) - but the crawl log doesnt show any URLs being fetched. Is there something different I need to do to run a crawl for a site

RE: Nutch with Domino web server

2006-07-21 Thread Luke Lim
Are you crawling jsp's? Put this in your regex-normalize.xml regex pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern substitution$1$3/substitution /regex *** And change this setting in your nutch-default.xml property nameurlnormalizer.class/name

Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Thanks for putting up with all the messages to the list... Here is the recrawl script for 0.8.0 if anyone is interested. Matt --- #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Hi Matt! In the article found at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou said the re-crawl script have a problem with updating the live search index. In my tests with Nutch version 0.7.2 when I run the script the index could not be update because the tomcat

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Renaud Richardet
Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what reloads Tomcat, not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml HTH, Renaud Lourival Júnior wrote: Hi

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Renaud Richardet wrote: Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what reloads Tomcat, not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml HTH, Renaud

PLease help... this has to be simple (re: mergesegs)

2006-07-21 Thread Honda Search Administrator
I'm running a few commands every week to keep my nutch clean, but I'm a bit confused if I'm doing it right. I merge the segments using the following command: bin/nutch mergesegs -dir crawl/segments/ -i -ds this should index the new segment and delete the old ones, which it does. After this

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception in thread main

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Lourival Júnior wrote: Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Lourival Júnior wrote: I thing it wont work with me because i'm using the Nutch version 0.7.2. Actually I use this script (some comments are in Portuguese): #!/bin/bash # A simple script to run a Nutch re-crawl # Fonte do script:

Help associating domain name and ip address

2006-07-21 Thread Sudhi Seshachala
Hello Nutchians I am sure many of you would have experienced the same problem as me right now. I have a domain name http://www.myopensourcejobs.com I have my app hosted on a server (virtual dedicated server) 68.x.x.x in Go daddy. I want to configure and associate IPaddress and domain

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Ok. However a few minutes ago I ran the script exactly you said and I still get this error: Exception in thread main java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at

Why would a record be in the database but not show up in the results?

2006-07-21 Thread Matt Timion
Does anyone have an idea why a record would be in the database but not show up in the results? I have 400+ pages from a certain domain in my database (checked using bin/nutch admin ) yet when I search for the domain, titles to certain pages from the domain, or unique URLs from the domain no

Re: Hadoop and Recrawl

2006-07-21 Thread Renaud Richardet
Hi Roberto, Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to Matthew Holt) HTH, Renaud Info wrote: Hi List I try to use this script with hadoop but don't work. I try to change ls with bin/hadoop dfs -ls But the script don't work because is ls -d and don't ls only.

Null pointer error when perform search

2006-07-21 Thread Eric Wu
Hi, I am new to Nutch and I got a null pointer exception whenI try to submit the search through demo app. Please see the error message below. I have modified the demo app to run in its webapp context other than in ROOT context. The first page shown and I put in the keyword to search and got the