Has anybody noticed that the RSS (i.e., OpenSearch) links in the pages
returned directly by Tomcat are wrong if the webapp's URL is like
http://localhost:8080/nutch-0.9/ ? Those links lack the '/nutch-0.9' part.
On the other hand, they are OK if the access passes through an Apache
connector.
- Original Message -
From: Le Quoc Anh [EMAIL PROTECTED]
Sent: Sunday, July 29, 2007 5:14 PM
.Hi everyone,
When I recrawl, I must delete indexes and index files, and re-create index
file. If I only indexer segments that I have just fetched and merger with
index existe, an error
Is there a quick way of knowing how many pages are indexed (_not_ how many
are referenced in crawldb as fetched URL's)? I could use Luke to peek inside
the indexes and get the Number of documents, but they are located on a
remote headless server with only SSH access... (OK, I actually did access
at org.apache.lucene.index.IndexReader.numDocs() method. You can
write a simple utility to run it in the shell.
On 7/28/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
Is there a quick way of knowing how many pages are indexed (_not_ how
many
are referenced in crawldb as fetched URL's)? I could
Not sure if this is a bug or a misconfiguration on my side, but here we go:
I installed the Nutch searcher webapp just by dropping its WAR file in
Tomcat's webapps directory, so the main URL for it is (for nutch-0.9.war)
http://localhost:8080/nutch-0.9/ . After I perform a search, in the
Another way would be to rewrite a search.jsp so that it return XML or JSON
rather than HTML, and then have the PHP code place a GET to that page and
parse the results (the SOLR approach, so to speak). The JVM (and Tomcat)
should obviously be run, but that could be done on a different machine.
has supported this for a long time already, and many people make good use
of it.
-Roger
- Original Message -
From: Enzo Michelangeli [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, June 30, 2007 11:08 AM
Subject: Re: integrate Nutch into my php front page
Another way
- Original Message -
From: Berlin Brown [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set from the
DmozParser. Lets also say that I want to go into the
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 5:48 PM
Enzo Michelangeli wrote:
- Original Message - From: Berlin Brown
[EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the actual pages like you would
As the size of my data keeps growing, and the indexing time grows even
faster, I'm trying to switch from a reindex all at every crawl model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
I have written a custom URLFilter that resolves the hostname into an IP
address and checks the latter against a GeoIP database. Unfortunately the
source code was developed under a commercial contract, and is not freely
available.
Enzo
- Original Message -
From: Cesar Voulgaris [EMAIL
- Original Message -
From: Phạm Hải Thanh [EMAIL PROTECTED]
Sent: Tuesday, June 12, 2007 9:29 AM
Hi all,
I have problem witch cache, after crawling searching successfully. The
cache page is display with square question marks, plz take a look at
- Original Message -
From: Phạm Hải Thanh [EMAIL PROTECTED]
Sent: Tuesday, June 12, 2007 10:06 AM
Oops, I am sorry, here is the link: http://
203.162.71.66:8080/cached.jsp?idx=0id=1
I also think this is a an issue of encoding too :(
It looks fine to me, both with Firefox and MSIE 7
- Original Message -
From: Doğacan Güney [EMAIL PROTECTED]
Sent: Friday, June 08, 2007 3:49 PM
[...]
Any idea?
This will certainly help a lot. If it is not too much trouble, can you
add debug outputs for hashCodes of conf objects (both for the one in
the cache and for the parameter,
- Original Message -
From: Doğacan Güney [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, June 08, 2007 8:27 PM
Subject: Re: Loading mechnism of plugin classes and singleton objects
[...]
This is strange, because, as you can see below, the strings that
make keys and values of conf
.. that's all I have to say at the moment.
On 6/5/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
It seems that plugin-loading code is somehow broken. There is some
discussion going on about this on
http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y .
On 6/5/07, Enzo
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Monday, June 04, 2007 2:05 PM
Er... I saw it mentioned at http://wiki.apache.org/nutch/FetchOptions ,
so I thought it was for real...
Sorry, this page is wrong and should be corrected - some of the options
listed
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Tuesday, June 05, 2007 4:56 PM
[...]
You can achieve a somewhat similar effect by controlling the number of
fetcher threads. I realize this is not as accurate as a specific control
mechanism, but so far it was
I have a question about the loading mechanism of plugin classes. I'm working
with a custom URLFilter, and I need a singleton object loaded and
initialized by the first instance of the URLFilter, and shared by other
instances (e.g., instantiated by other threads). I was assuming that the
it in
my plugin's finalize() method, AND ensuring that the finalizers be called on
exit by placing somewhere in the initialization code the (deprecated) call
System.runFinalizersOnExit(true) .
Enzo
- Original Message -
From: Enzo Michelangeli [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent
In my case (with Nutch 0.8), it seems not: I set it to 500, and the fetcher
still saturates the 1.5 Mbit/s link... Is it supposed to work for the total
bandwidth, or for each thread?
Enzo
-
This SF.net email is sponsored
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Monday, June 04, 2007 1:31 AM
Enzo Michelangeli wrote:
In my case (with Nutch 0.8), it seems not: I set it to 500, and the
fetcher still saturates the 1.5 Mbit/s link... Is it supposed to work for
the total bandwidth
- Original Message -
From: Dennis Kubes [EMAIL PROTECTED]
Sent: Friday, June 01, 2007 12:44 PM
[...]
We are also using BIND and our current index is 52,519,267 pages so you
should be fine with this. I think djbdns is just easier to use. Are you
using any big DNS caches as backups?
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 2:25 PM
Are you running jobs in the local mode? In distributed mode filtering is
naturally parallel, because you have as many concurrent lookups as there
are map tasks.
I'm just using the
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 11:39 PM
Caching seems to be the only solution. Even if you were able to fire DNS
requests more rapidly, remote servers wouldn't be able (or wouldn't like
to) respond that quickly ...
Then why
Is there a way of parallelizing URLFiltering over multiple threads? After
all, the URLFilters themselves must already be thread-safe, or else they
would have problems during fetching.
The reason why I'm asking is I have a custom URLFilter that needs to make
calls to the DNS resolver, and
The ssh client is provided by the OpenSSH package, which can be installed
through the Cygwin setup (under the net category).
Enzo
- Original Message -
From: Ilya Vishnevsky [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 30, 2007 7:56 PM
Subject: Nutch on Windows. ssh:
the whole crawldb? Can anyone please tell me where
does it cache the whole crawldb? I don't think it is possible to cache
it on RAM. Is it cached in some location on the hard disk.
Please clarify this point.
On 5/27/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
- Original Message
- Original Message -
From: Manoharam Reddy [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, May 26, 2007 6:23 PM
After I create the crawldb after running bin/nutch crawl, I start my
Tomcat server. It gives proper search results.
What I am wondering is that even after I delete,
I have a similar problem, but only under Cygwin. Apparently, changes to the
value of plugin.includes are noticed only if made in nutch-default.xml, not
in nutch-site.xml .
The same happened to another user (see
http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200609.mbox/[EMAIL
I understand that mergedb ... -filter can be used to remove links that do
not meet the filtering requirements of the active URLFilters. However,
mergedb operates on the whole crawldb, and can be very slow. Is there a way
of enforcing filtering at updatedb time, preventing the unfetchable links
At http://wiki.apache.org/nutch/GettingNutchRunningWithUtf8 it is suggested,
in order to handle UTF-8 characters in GET parameters, to change the
configuration of the application server. Why can't the webapp just switch
the request object to UTF-8 encoding, e.g. by placing in the head
section
32 matches
Mail list logo