Hello,
I have a query.
I have created an index of our online documentation files (htmls). Therefore
it is more like an intranet search, that is, the search will be performed on
static documents only. Now I need to test it. My machine does not have
Tomcat installed. The IT department has in
hi
i am working with nutch-0.8.1 and i am trying configure hadoop but my
questions are:
-in the directory bin exist the files:
hadoop, hadoop-daemon, hadoop-daemons, nutch, rcc, slaves, start-all,
start-dfs, start-mapred, stop-all, stop-dfs, stop-mapred
this files are necesary for run nutch
> i am working with nutch-0.8.1 and i am trying configure hadoop but my
> questions are:
>
> -in the directory bin exist the files:
>
> hadoop, hadoop-daemon, hadoop-daemons, nutch, rcc, slaves, start-all,
> start-dfs, start-mapred, stop-all, stop-dfs, stop-mapred
>
> this files are necesary f
Hi
I'm try to get the nutch/hadoop example from
http://wiki.apache.org/nutch/NutchHadoopTutorial
running.
I've set up the urllist.txm and the crawl-urlfilter.xml as instructed in the
tutorial, but whenever I run the crawl it either reports
Generator: 0 records selected for fetching, exiting
Excellent.
Thank you.
Per.
Marcin Okraszewski skrev:
There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg.
remove string after '#"). Remember to have urlnormalizer-regex in
plugins.include option (nutch-site.xml).
Marcin
Dnia 26 stycznia 2008 9:36 Prafulla <[EMAI
Adding this to your conf/regex-normalize.xml should remove the anchor from
the URLs:
\#(.*)
Regards,
Siddhartha
On Jan 26, 2008 1:41 PM, Per Andreas Buer <[EMAIL PROTECTED]> wrote:
> Hi.
>
> I'm indexing an intranet and I see some pages are fetched twenty times.
> There are a lot of anch
Hello everybody,
I have run into a rather weird problem that occurs when deploying a
Grails (http://grails.codehaus.org/) app as a WAR file in Tomcat. My
app instantiates a NutchDocumentAnalyzer during startup as a Spring
resource. The Nutch classes and config files are loaded from a JAR
If the crawl stops at depth=0, it means there is nothing to fetch in
the first fetch cycle itself. Therefore there is no data to extract.
Also, you mention about crawl-urlfilter.xml in your message. I hope
this was a typo because there is no such file. The actual filter is
crawl-urlfilter.txt.
Fo