Re: Got it up and running

2006-03-30 Thread Raghavendra Prabhu
Hi da Can you uninstall jre1.5 in the system and then test out with the new things? I want to confirm that the problems which you are occuring? But i have not been able to reproduce the problem of irrelevant title coming up? Anyway do it in ur free time.. How is ur prep going for nov? Rgds Pra

Re: Got it up and running

2006-03-30 Thread Lukas Vlcek
Hello, It tried this and I think I found an issue. I searched for the word "Hello" and I reviewed first hits. It can't return the third or fourth page, ... try yourself: http://71.35.163.79/search.jsp?query=hello&start=20&hitsPerPage=10&hitsPerSite=2&clustering= Interestingly, it works if you modi

Got it up and running

2006-03-30 Thread Dan Morrill
Thanks folks, I ran a field test today, and just slotted in the new index on the test server. If anyone wants to take a look at it, the FE is not customized yet, but the help has been excellent, and just want to show off the work. Its at http://71.35.163.79/index.jsp if you are interested. The

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Mehmet Tan wrote: Hi, I want to ask a question about redirections. Correct me if I'm wrong but if a page is redirected to a page that is already in the webdb, then the next updatedb operation will overwrite all previous info about refetch, because it is a newly creat

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki
Mehmet Tan wrote: Hi, I want to ask a question about redirections. Correct me if I'm wrong but if a page is redirected to a page that is already in the webdb, then the next updatedb operation will overwrite all previous info about refetch, because it is a newly created page in the fetcher wh

Re: Common Terms

2006-03-30 Thread Rajesh Munavalli
I guess you would need them in phrase query. If you do not index them, you would never be able to retrieve something like "the americas". --Rajesh Munavalli Blog: http://mathsearch.blogspot.com On 3/30/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote: > >Thanks. That gives me the full l

RE: Common Terms

2006-03-30 Thread Vanderdray, Jacob
Thanks. That gives me the full list. The odd thing to me is that none of those words will end up being effective in a search, so why not strip them all out during indexing? Thanks again, Jake. -Original Message- From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] Sent: Thursday, M

Re: Common Terms

2006-03-30 Thread Rajesh Munavalli
There is a list of stop words in NutchAnalysis class (org.apache.nutch.analysis). I guess thats where the common terms are removed during analysis. --Rajesh Munavalli Blog: http://mathsearch.blogspot.com Vanderdray, Jacob wrote: I've added some code to query-basic to log the query aft

Re: Common Terms

2006-03-30 Thread Rajesh Munavalli
There is a list of stop words in NutchAnalysis class ( org.apache.nutch.analysis). I guess thats where the common terms are removed during analysis. --Rajesh Munavalli Blog: http://mathsearch.blogspot.com On 3/30/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote: > >I've added some code to

Common Terms

2006-03-30 Thread Vanderdray, Jacob
I've added some code to query-basic to log the query after it has run both addTerms and addPhrases. This helps me to better understand what's going on. I've noticed that when my search contains words like "the" or "a", those don't appear in the actual query. It looks to me like t

Re: Multiple crawls how to get them to work together

2006-03-30 Thread Berlin Brown
Do you have that shell script? On 3/30/06, Dan Morrill <[EMAIL PROTECTED]> wrote: > Hi folks, > > It worked, it worked great, I made a shell script to do the work for me. > Thank you, thank you, and again, thank you. > > r/d > > -Original Message- > From: Dan Morrill [mailto:[EMAIL PROTECT

Re: html parser

2006-03-30 Thread Rajesh Munavalli
Ooops...actually I meant to ask XHTML parser. Is it safe to use HTML parser to parse XHTML? On 3/30/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Rajesh Munavalli wrote: > > Does anyone know where I can get the source code for html parser which > is in > > the plugins directory? > > > > Whic

Re: html parser

2006-03-30 Thread Andrzej Bialecki
Rajesh Munavalli wrote: Does anyone know where I can get the source code for html parser which is in the plugins directory? Which one? parse-html uses two parsers: one is called CyberNeko, the other is called TagSoup. You can find their home pages and their sources easily through Google.

html parser

2006-03-30 Thread Rajesh Munavalli
Does anyone know where I can get the source code for html parser which is in the plugins directory?

Re: [Nutch-general] Re: Using Nutch with Ferret (ruby)

2006-03-30 Thread mike c
Hi Erik, Thanks for pointing this out - as I just got Ferret working with indexes created using Nutch. Any recommendations on how to address this issue? -Mike On 3/30/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: > There is one incompatibility between Ferret and Java Lucene of note. > It is the "U

Re: [Nutch-general] Re: Using Nutch with Ferret (ruby)

2006-03-30 Thread Erik Hatcher
There is one incompatibility between Ferret and Java Lucene of note. It is the "UTF-8" issue that has surfaced with regards to Java Lucene. All can be well between Java Lucene and Ferret, until characters in another range are indexed, and then Ferret will blow up trying to search the inde

Re: Using Nutch with Ferret (ruby)

2006-03-30 Thread mike c
Thanks. I'll try it out. In the mean time, if I get Ferret working I'll post an update. -Mike On 3/30/06, Steven Yelton <[EMAIL PROTECTED]> wrote: > I use WEBrick instead of tomcat to query and serve search results. I > used ruby's 'rjb' to bridge the gap. > > http://raa.ruby-lang.org/project/

Re: Legal issues

2006-03-30 Thread TDLN
The "official" reason one reads about, is that in case the server that the page resides on is down or unreachable, the user can stil access the search result. The Google Terms phrase it like this: "Google stores many web pages in its cache to retrieve for users as a back-up in case the page's serv

Re: Using Nutch with Ferret (ruby)

2006-03-30 Thread Steven Yelton
I use WEBrick instead of tomcat to query and serve search results. I used ruby's 'rjb' to bridge the gap. http://raa.ruby-lang.org/project/rjb/ There may be more direct ways (ruby<->lucene), but this was quick and easy and still has decent performance. Steven mike c wrote: Hi all, I was

Using Nutch with Ferret (ruby)

2006-03-30 Thread mike c
Hi all, I was wondering if anyone is using Nutch (for crawling) with Ferret (indexing / searching). Basically, my front-end is built using Ruby on Rails that's why I'm asking. I have the Nutch crawler up and running fine, but can't seem to figure out how to integrate the two. Any help is appreci

RE: Multiple crawls how to get them to work together

2006-03-30 Thread Dan Morrill
Hi folks, It worked, it worked great, I made a shell script to do the work for me. Thank you, thank you, and again, thank you. r/d -Original Message- From: Dan Morrill [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 5:12 AM To: nutch-user@lucene.apache.org Subject: RE: Multipl

Re: Legal issues

2006-03-30 Thread Insurance Squared Inc.
I'm not trying to argue legalities, just pointing out that there's an undercurrent out there in the community where there's some backlash against SE's and crawlers because of the cache. Here's an example; this guy: http://incredibill.blogspot.com/ is scraper/bot/crawler crazy. And he actively

Re: Legal issues

2006-03-30 Thread Nutch Newbie
Hmmm.. How about this... The photographer who take a photo has the copyright over the photo not the owner of the picture motive, you, me or any other photo object. So caching is nothing but taking a picture using another sort of camera called robot :-) Nothing more really. If a browser maker decide

Re: Legal issues

2006-03-30 Thread Insurance Squared Inc.
FWIW, I believe all of what's been stated is the case - and I'd also assume that since Google/MSN/Yahoo are all doing this that it's been tested and OK. However I know many people complain about the cache. Some people see it as a copyright violation - technically correct or not, the cache doe

RE: Legal issues

2006-03-30 Thread Dan Morrill
If I remember it correctly, google as been sued and won a number of times on this issue, you can cache, you can search others web sites, grocklaw has the data on this one, but I know you can search, you can cache under fair use, and the idea of public access, as long as you are not cracking passwor

RE: Multiple crawls how to get them to work together

2006-03-30 Thread Dan Morrill
Aled, I'll try that today, excellent, and thanks for the heads up on the db directory. I'll let you now how it goes. r/d -Original Message- From: Aled Jones [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 12:24 AM To: nutch-user@lucene.apache.org Subject: ATB: Multiple crawl

Feeds / Nutch

2006-03-30 Thread Richard Rodrigues
Hello, I would like to know what would be the best way to crawl webpages from a feed like http://www.weblogs.com/changes.xml ? The goal would be for example to crawl everyday each page changed and monitored by weblogs.com. Bests, Richard

Re: Legal issues

2006-03-30 Thread TDLN
Google's and Yahoo's Terms of Service provide interesting reading regarding such legal issues. http://www.google.com/terms_of_service.html http://docs.yahoo.com/info/terms/ Rgrds, Thomas On 3/30/06, gekkokid <[EMAIL PROTECTED]> wrote: > > Shouldn't be a problem if your honouring the robots.txt >

Crawler

2006-03-30 Thread David Webster
Can someone recommend a good crawler to work with CLucene?

Re: Legal issues

2006-03-30 Thread gekkokid
Shouldn't be a problem if your honouring the robots.txt Legal issues could be Stealing Copyrighted Material? thats if your reproducing it but if your analysing the content and links and keeping to the robots.txt rules I doubt your have a problem unless its crawling every 10 minutes, wouldn't

Re: ATB: Multiple crawls how to get them to work together

2006-03-30 Thread Berlin Brown
So, the 'db' is never used during the searching aspect. Interesting. 'segments' is more for run-time use. On 3/30/06, Aled Jones <[EMAIL PROTECTED]> wrote: > Hi Dan > > I'll presume you've done the crawls already.. > > Each resulting crawled folder should have 3 folders, db, index and > segments

ATB: Multiple crawls how to get them to work together

2006-03-30 Thread Aled Jones
Hi Dan I'll presume you've done the crawls already.. Each resulting crawled folder should have 3 folders, db, index and segments. Create your search.dir folder and create a segments folder in that. Each segments folder in each crawl folder should contain folders with timestamps as the names. C