Re: Error running intranet crawl with 0.8.0-dev

2006-07-13 Thread manish kothari
Daniel Varela Santoalla dvarela at ecmwf.int writes: Hello Daniel as i saw ur problem,if u set env variable properly ,plz check out is u set NUTH_HOME properly,and before plz read shell script of nutch (bin/nutch).i think now ur problem will solve out.If its not working u can send me

Re: nutch 0.7.2 does not work

2006-07-13 Thread manish_sanju
hello i read ur exception.i think ,when u copy the nutch 0.7.2.war file from nutch/bulid directory after run ant command,u miss some as clear in exception,copy this directory org.apache.nutch.searcher.NutchBean from nutch../build .i think this will search properly bye -- View this message

nutch suitable for blogs?

2006-07-13 Thread Chris Newton
Hi all. First off, I'm using Nutch 0.72. I've been playing with nutch for a couple weeks now, and have some questions relating to indexing blog sites. Many blog platforms have a changes.xml file posted on some schedule ( blogger.com/changes10.xml is every 10 minutes), that list the blogs

Commom words

2006-07-13 Thread Marco Pereira
Hi, Is there a way to Nutch ignore commom words while searching? For example, while searching for the boy and the girl it would only look for boy girl. Thanks, Marco

RE: Commom words

2006-07-13 Thread Bogdan Kecman
Is there a way to Nutch ignore commom words while searching? For example, while searching for the boy and the girl it would only look for boy girl. Yes, In nutch conf dir there is a file common-terms.utf8 Copy that file also in your java container Hope this helps Bogdan

RE: Commom words

2006-07-13 Thread Bogdan Kecman
Is there a way to Nutch ignore commom words while searching? For example, while searching for the boy and the girl it would only look for boy girl. Small addition from wiki: http://wiki.apache.org/nutch/FAQ#head-12f4fd64f03fc3cd0a3063b9283ed829963ed4 88 You can tweak your

Nutch and the Law

2006-07-13 Thread Marco Pereira
Hi, What if you start indexing videos and audio files, and without knowing you index some mp3 or video that is illegal or protect by rights. So to index videos and audio files there should be a human looking each indexed video or audio file? What do you think? Marco

nutch-0.8.0-dev search error

2006-07-13 Thread Matthew Holt
I successfully ran the intranet crawl and my nutch/crawl dir was generated. I then deployed the war file and stopped/started tomcat from within the crawl directory. However, when I attempt to actually run a search, a page with the following error is returned. Any ideas? Matt *type*

Re: nutch-0.8.0-dev search error

2006-07-13 Thread Timo Scheuer
Am Donnerstag, 13. Juli 2006 18:47 schrieb Matthew Holt: I successfully ran the intranet crawl and my nutch/crawl dir was generated. I then deployed the war file and stopped/started tomcat from within the crawl directory. However, when I attempt to actually run a search, a page with the

Re: nutch-0.8.0-dev search error

2006-07-13 Thread Matthew Holt
Timo Scheuer wrote: Am Donnerstag, 13. Juli 2006 18:47 schrieb Matthew Holt: I successfully ran the intranet crawl and my nutch/crawl dir was generated. I then deployed the war file and stopped/started tomcat from within the crawl directory. However, when I attempt to actually run a search,

0.8.0 stable enough to use?

2006-07-13 Thread Matthew Holt
Just wondering what the general consensus is on using 0.8.0 in production. Do you think it's stable enough to use?? I would ideally want to use 0.7.2, but it is missing the parse-oo plugin that 0.8.0 has. I attempted to port the parse-oo plugin to 0.7.2, but ran into some complications due to

Re: 0.8.0 stable enough to use?

2006-07-13 Thread Jayant Kumar Gandhi
I remember reading in one of the threads a few weeks back. Most people agreed that 0.8-dev is stable for a release. I dont know what happened after that. I expect something might be out in a couple weeks or by mid-Aug. Cheers, Jayant On 7/13/06, Matthew Holt [EMAIL PROTECTED] wrote: Just

Takes a long time for the reduce to go from 95% to 100%

2006-07-13 Thread Shekhar, Jayant
Hi, One thing that lots of us have noticed is that it takes a very long time for the reduce to go from 95% to 100% in many cases. I am running a crawl with 250 urls in the CrawlDB using 50 machines. UpdateDB and ReadDB take a long time to go from 95% to 100%. Here is the major problem that I

Re: 0.8.0 stable enough to use?

2006-07-13 Thread Andrzej Bialecki
Jayant Kumar Gandhi wrote: I remember reading in one of the threads a few weeks back. Most people agreed that 0.8-dev is stable for a release. I dont know what happened after that. I expect something might be out in a couple weeks or by mid-Aug. ..what happened is that most people went on

Added 0 pages

2006-07-13 Thread Julius Schorzman
I'm having trouble figuring out why I keep getting Added 0 pages when running the crawl with nutch. I've searched the site and can't find an answer to as what might be going wrong. I'm running this on windows using eclipse because I may have to change the code slightly. I've already made a few

Recrawl a specific web Page

2006-07-13 Thread Lourival Júnior
How can i recrawl a specific web page. For example I have a html page that is constantly update. There a command for that? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: nutch suitable for blogs?

2006-07-13 Thread Ken Krugler
Hi Chris, Hi all. First off, I'm using Nutch 0.72. I've been playing with nutch for a couple weeks now, and have some questions relating to indexing blog sites. [snip] Third... just in general... it seems I've had to goof with nutch's config enough to make this work in this way, that

Added 0 pages

2006-07-13 Thread Julius Schorzman
I'm having trouble figuring out why I keep getting Added 0 pages when running the crawl with nutch. I've searched the site and can't find an answer to as what might be going wrong. I'm running this on windows using eclipse because I may have to change the code slightly. I've already made a few

Re: Added 0 pages

2006-07-13 Thread Karsten Dello
Hi, in my opinion Julius Schorzman wrote: http://www.apache.com is not matched by the regex +^http://([a-z0-9]*\.)*apache.com/ as it does not end with a trailing slash. Cheers Karsten

Extending scoring plugin

2006-07-13 Thread Jacob Brunson
I'm only a moderately experienced java programmer, so I was hoping I could get a few pointers about where to begin on a particular problem. I want to increase the score of a search result if the title contains the search query and the site is from a particular site. I thought that I could do

Re: Extending scoring plugin

2006-07-13 Thread Stefan Groschupf
I'm only a moderately experienced java programmer, so I was hoping I could get a few pointers about where to begin on a particular problem. I want to increase the score of a search result if the title contains the search query and the site is from a particular site. Take a look to the

Re: Extending scoring plugin

2006-07-13 Thread Jacob Brunson
On 7/13/06, Stefan Groschupf [EMAIL PROTECTED] wrote: I'm only a moderately experienced java programmer, so I was hoping I could get a few pointers about where to begin on a particular problem. I want to increase the score of a search result if the title contains the search query and the

Re: Extending scoring plugin

2006-07-13 Thread Andrzej Bialecki
Jacob Brunson wrote: orry, maybe I should have made myself a little more clear. I know I can increase the boost generally on title matches, but what I want is to further increase the boost on title matches ONLY IF the url is from domain XYZ.com Depending on whether you need this change to