Tomcat query

2008-01-28 Thread Jaya Ghosh
Hello, I have a query. I have created an index of our online documentation files (htmls). Therefore it is more like an intranet search, that is, the search will be performed on static documents only. Now I need to test it. My machine does not have Tomcat installed. The IT department has in

Nutch and Hadoop

2008-01-28 Thread payo
hi i am working with nutch-0.8.1 and i am trying configure hadoop but my questions are: -in the directory bin exist the files: hadoop, hadoop-daemon, hadoop-daemons, nutch, rcc, slaves, start-all, start-dfs, start-mapred, stop-all, stop-dfs, stop-mapred this files are necesary for run nutch

Re: Nutch and Hadoop

2008-01-28 Thread John Mendenhall
> i am working with nutch-0.8.1 and i am trying configure hadoop but my > questions are: > > -in the directory bin exist the files: > > hadoop, hadoop-daemon, hadoop-daemons, nutch, rcc, slaves, start-all, > start-dfs, start-mapred, stop-all, stop-dfs, stop-mapred > > this files are necesary f

Simple crawl fails to find any URLs

2008-01-28 Thread Barry Haddow
Hi I'm try to get the nutch/hadoop example from http://wiki.apache.org/nutch/NutchHadoopTutorial running. I've set up the urllist.txm and the crawl-urlfilter.xml as instructed in the tutorial, but whenever I run the crawl it either reports Generator: 0 records selected for fetching, exiting

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

2008-01-28 Thread Per Andreas Buer
Excellent. Thank you. Per. Marcin Okraszewski skrev: There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg. remove string after '#"). Remember to have urlnormalizer-regex in plugins.include option (nutch-site.xml). Marcin Dnia 26 stycznia 2008 9:36 Prafulla <[EMAI

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

2008-01-28 Thread Siddhartha Reddy
Adding this to your conf/regex-normalize.xml should remove the anchor from the URLs: \#(.*) Regards, Siddhartha On Jan 26, 2008 1:41 PM, Per Andreas Buer <[EMAIL PROTECTED]> wrote: > Hi. > > I'm indexing an intranet and I see some pages are fetched twenty times. > There are a lot of anch

common-terms.utf8 not found in class path when using Nutch from WAR file

2008-01-28 Thread Björn Wilmsmann
Hello everybody, I have run into a rather weird problem that occurs when deploying a Grails (http://grails.codehaus.org/) app as a WAR file in Tomcat. My app instantiates a NutchDocumentAnalyzer during startup as a Spring resource. The Nutch classes and config files are loaded from a JAR

Re: Simple crawl fails to find any URLs

2008-01-28 Thread Susam Pal
If the crawl stops at depth=0, it means there is nothing to fetch in the first fetch cycle itself. Therefore there is no data to extract. Also, you mention about crawl-urlfilter.xml in your message. I hope this was a typo because there is no such file. The actual filter is crawl-urlfilter.txt. Fo