Hi da
Can you uninstall jre1.5 in the system and then test out with the new
things?
I want to confirm that the problems which you are occuring? But i have not
been able to reproduce the problem of irrelevant title coming up?
Anyway do it in ur free time..
How is ur prep going for nov?
Rgds
Pra
Hello,
It tried this and I think I found an issue. I searched for the word
"Hello" and I reviewed first hits. It can't return the third or fourth
page, ... try yourself:
http://71.35.163.79/search.jsp?query=hello&start=20&hitsPerPage=10&hitsPerSite=2&clustering=
Interestingly, it works if you modi
Thanks folks, I ran a field test today, and just slotted in the new index on
the test server. If anyone wants to take a look at it, the FE is not
customized yet, but the help has been excellent, and just want to show off
the work.
Its at http://71.35.163.79/index.jsp if you are interested. The
Andrzej Bialecki wrote:
Mehmet Tan wrote:
Hi,
I want to ask a question about redirections. Correct me if I'm wrong
but if a page is redirected to a page that is already in the webdb,
then the
next updatedb operation will overwrite all previous info about refetch,
because it is a newly creat
Mehmet Tan wrote:
Hi,
I want to ask a question about redirections. Correct me if I'm wrong
but if a page is redirected to a page that is already in the webdb,
then the
next updatedb operation will overwrite all previous info about refetch,
because it is a newly created page in the fetcher wh
I guess you would need them in phrase query. If you do not index them, you
would never be able to retrieve something like "the americas".
--Rajesh Munavalli
Blog: http://mathsearch.blogspot.com
On 3/30/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote:
>
>Thanks. That gives me the full l
Thanks. That gives me the full list. The odd thing to me is
that none of those words will end up being effective in a search, so why
not strip them all out during indexing?
Thanks again,
Jake.
-Original Message-
From: Rajesh Munavalli [mailto:[EMAIL PROTECTED]
Sent: Thursday, M
There is a list of stop words in NutchAnalysis class
(org.apache.nutch.analysis). I guess thats where the common terms are
removed during analysis.
--Rajesh Munavalli
Blog: http://mathsearch.blogspot.com
Vanderdray, Jacob wrote:
I've added some code to query-basic to log the query aft
There is a list of stop words in NutchAnalysis class (
org.apache.nutch.analysis). I guess thats where the common terms are removed
during analysis.
--Rajesh Munavalli
Blog: http://mathsearch.blogspot.com
On 3/30/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote:
>
>I've added some code to
I've added some code to query-basic to log the query after it
has run both addTerms and addPhrases. This helps me to better
understand what's going on. I've noticed that when my search contains
words like "the" or "a", those don't appear in the actual query.
It looks to me like t
Do you have that shell script?
On 3/30/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
> Hi folks,
>
> It worked, it worked great, I made a shell script to do the work for me.
> Thank you, thank you, and again, thank you.
>
> r/d
>
> -Original Message-
> From: Dan Morrill [mailto:[EMAIL PROTECT
Ooops...actually I meant to ask XHTML parser. Is it safe to use HTML parser
to parse XHTML?
On 3/30/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> Rajesh Munavalli wrote:
> > Does anyone know where I can get the source code for html parser which
> is in
> > the plugins directory?
> >
>
> Whic
Rajesh Munavalli wrote:
Does anyone know where I can get the source code for html parser which is in
the plugins directory?
Which one? parse-html uses two parsers: one is called CyberNeko, the
other is called TagSoup. You can find their home pages and their sources
easily through Google.
Does anyone know where I can get the source code for html parser which is in
the plugins directory?
Hi Erik,
Thanks for pointing this out - as I just got Ferret working with
indexes created using Nutch. Any recommendations on how to address
this issue?
-Mike
On 3/30/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> There is one incompatibility between Ferret and Java Lucene of note.
> It is the "U
There is one incompatibility between Ferret and Java Lucene of note.
It is the "UTF-8" issue that has surfaced with regards to Java
Lucene. All can be well between Java Lucene and Ferret, until
characters in another range are indexed, and then Ferret will blow up
trying to search the inde
Thanks. I'll try it out. In the mean time, if I get Ferret working
I'll post an update.
-Mike
On 3/30/06, Steven Yelton <[EMAIL PROTECTED]> wrote:
> I use WEBrick instead of tomcat to query and serve search results. I
> used ruby's 'rjb' to bridge the gap.
>
> http://raa.ruby-lang.org/project/
The "official" reason one reads about, is that in case the server that the
page resides on is
down or unreachable, the user can stil access the search result. The Google
Terms phrase it like this: "Google stores many
web pages in its cache to retrieve for users as a back-up in case the page's
serv
I use WEBrick instead of tomcat to query and serve search results. I
used ruby's 'rjb' to bridge the gap.
http://raa.ruby-lang.org/project/rjb/
There may be more direct ways (ruby<->lucene), but this was quick and
easy and still has decent performance.
Steven
mike c wrote:
Hi all,
I was
Hi all,
I was wondering if anyone is using Nutch (for crawling) with Ferret
(indexing / searching). Basically, my front-end is built using Ruby
on Rails that's why I'm asking. I have the Nutch crawler up and
running fine, but can't seem to figure out how to integrate the two.
Any help is appreci
Hi folks,
It worked, it worked great, I made a shell script to do the work for me.
Thank you, thank you, and again, thank you.
r/d
-Original Message-
From: Dan Morrill [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 30, 2006 5:12 AM
To: nutch-user@lucene.apache.org
Subject: RE: Multipl
I'm not trying to argue legalities, just pointing out that there's an
undercurrent out there in the community where there's some backlash
against SE's and crawlers because of the cache. Here's an example; this
guy: http://incredibill.blogspot.com/ is scraper/bot/crawler crazy.
And he actively
Hmmm.. How about this... The photographer who take a photo has the
copyright over the photo not the owner of the picture motive, you, me
or any other photo object. So caching is nothing but taking a picture
using another sort of camera called robot :-) Nothing more really. If
a browser maker decide
FWIW, I believe all of what's been stated is the case - and I'd also
assume that since Google/MSN/Yahoo are all doing this that it's been
tested and OK.
However I know many people complain about the cache. Some people see it
as a copyright violation - technically correct or not, the cache doe
If I remember it correctly, google as been sued and won a number of times on
this issue, you can cache, you can search others web sites, grocklaw has the
data on this one, but I know you can search, you can cache under fair use,
and the idea of public access, as long as you are not cracking passwor
Aled,
I'll try that today, excellent, and thanks for the heads up on the db
directory. I'll let you now how it goes.
r/d
-Original Message-
From: Aled Jones [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 30, 2006 12:24 AM
To: nutch-user@lucene.apache.org
Subject: ATB: Multiple crawl
Hello,
I would like to know what would be the best way to crawl webpages from a
feed like
http://www.weblogs.com/changes.xml ?
The goal would be for example
to crawl everyday each page changed and monitored by weblogs.com.
Bests,
Richard
Google's and Yahoo's Terms of Service provide interesting reading regarding
such legal issues.
http://www.google.com/terms_of_service.html
http://docs.yahoo.com/info/terms/
Rgrds, Thomas
On 3/30/06, gekkokid <[EMAIL PROTECTED]> wrote:
>
> Shouldn't be a problem if your honouring the robots.txt
>
Can someone recommend a good crawler to work with CLucene?
Shouldn't be a problem if your honouring the robots.txt
Legal issues could be Stealing Copyrighted Material? thats if your
reproducing it but if your analysing the content and links and keeping to
the robots.txt rules I doubt your have a problem unless its crawling every
10 minutes,
wouldn't
So, the 'db' is never used during the searching aspect. Interesting.
'segments' is more for run-time use.
On 3/30/06, Aled Jones <[EMAIL PROTECTED]> wrote:
> Hi Dan
>
> I'll presume you've done the crawls already..
>
> Each resulting crawled folder should have 3 folders, db, index and
> segments
Hi Dan
I'll presume you've done the crawls already..
Each resulting crawled folder should have 3 folders, db, index and
segments.
Create your search.dir folder and create a segments folder in that.
Each segments folder in each crawl folder should contain folders with
timestamps as the names. C
32 matches
Mail list logo