Null pointer error when perform search

2006-07-21 Thread Eric Wu
Hi, I am new to Nutch and I got a null pointer exception whenI try to submit the search through demo app. Please see the error message below. I have modified the demo app to run in its webapp context other than in ROOT context. The first page shown and I put in the keyword to search and got the e

Re: Hadoop and Recrawl

2006-07-21 Thread Renaud Richardet
Hi Roberto, Did you try http://wiki.apache.org/nutch/IntranetRecrawl (thanks to Matthew Holt) HTH, Renaud Info wrote: Hi List I try to use this script with hadoop but don't work. I try to change ls with bin/hadoop dfs -ls But the script don't work because is ls -d and don't ls only. Someon

Hadoop and Recrawl

2006-07-21 Thread Info
Hi List I try to use this script with hadoop but don't work. I try to change ls with bin/hadoop dfs -ls But the script don't work because is ls -d and don't ls only. Someone can help me Best Regards Roberto Navoni -Messaggio originale- Da: Matthew Holt [mailto:[EMAIL PROTECTED] Inviato

Why would a record be in the database but not show up in the results?

2006-07-21 Thread Matt Timion
Does anyone have an idea why a record would be in the database but not show up in the results? I have 400+ pages from a certain domain in my database (checked using bin/nutch admin ) yet when I search for the domain, titles to certain pages from the domain, or unique URLs from the domain no res

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Ok. However a few minutes ago I ran the script exactly you said and I still get this error: Exception in thread "main" java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at org.apache.lucene.store.FSDirectory.init(FSDirectory

Help associating domain name and ip address

2006-07-21 Thread Sudhi Seshachala
Hello Nutchians I am sure many of you would have experienced the same problem as me right now. I have a domain name http://www.myopensourcejobs.com I have my app hosted on a server (virtual dedicated server) 68.x.x.x in Go daddy. I want to configure and associate IPaddress and domain n

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Lourival Júnior wrote: I thing it wont work with me because i'm using the Nutch version 0.7.2. Actually I use this script (some comments are in Portuguese): #!/bin/bash # A simple script to run a Nutch re-crawl # Fonte do script: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutc

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
I thing it wont work with me because i'm using the Nutch version 0.7.2. Actually I use this script (some comments are in Portuguese): #!/bin/bash # A simple script to run a Nutch re-crawl # Fonte do script: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html #{ if [ -n "$

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Lourival Júnior wrote: Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception in

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception in thread "main" java.io.IOE

PLease help... this has to be simple (re: mergesegs)

2006-07-21 Thread Honda Search Administrator
I'm running a few commands every week to keep my nutch clean, but I'm a bit confused if I'm doing it right. I merge the segments using the following command: bin/nutch mergesegs -dir crawl/segments/ -i -ds this should index the new segment and delete the old ones, which it does. After this wh

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Renaud Richardet wrote: Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what "reloads Tomcat", not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml HTH, Renaud Louri

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Renaud Richardet
Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what "reloads Tomcat", not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml HTH, Renaud Lourival Júnior wrote: Hi Mat

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Hi Matt! In the article found at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou said the re-crawl script have a problem with updating the live search index. In my tests with Nutch version 0.7.2 when I run the script the index could not be update because the tomcat lo

Recrawl script for 0.8.0 completed...

2006-07-21 Thread Matthew Holt
Thanks for putting up with all the messages to the list... Here is the recrawl script for 0.8.0 if anyone is interested. Matt --- #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch

RE: Nutch with Domino web server

2006-07-21 Thread Luke Lim
Are you crawling jsp's? Put this in your regex-normalize.xml (.*)(;jsessionid=[a-zA-Z0-9]{32})(.*) $1$3 *** And change this setting in your nutch-default.xml urlnormalizer.class org.apache.nutch.net.RegexUrlNormalizer Name of the class used to normalize URLs. -Original

Nutch with Domino web server

2006-07-21 Thread Deepa Devanathan
hi guys, I tried crawling my site which works with a Domino web server talking to a Tomcat - using the crawl command ( with all the config for urls, file-types etc etc) - but the crawl log doesnt show any URLs being fetched. Is there something different I need to do to run a crawl for a site run

Re: Best performance approach for single MP machine?

2006-07-21 Thread Doug Cook
Thanks, Håvard (and Doug, in the original email). Those pointers, plus a few other tips from elsewhere, did the trick. I'm now up and running with all CPUs. One thing I found along the way was that if I did not set mapred.child.heap.size, I would run out of heap space in initialization of inject