Re: [Nutch-general] Tomcat without Apache

2007-08-02 Thread Enzo Michelangeli
Has anybody noticed that the RSS (i.e., OpenSearch) links in the pages returned directly by Tomcat are wrong if the webapp's URL is like http://localhost:8080/nutch-0.9/ ? Those links lack the '/nutch-0.9' part. On the other hand, they are OK if the access passes through an Apache connector.

Re: [Nutch-general] error merger index

2007-07-29 Thread Enzo Michelangeli
- Original Message - From: Le Quoc Anh [EMAIL PROTECTED] Sent: Sunday, July 29, 2007 5:14 PM .Hi everyone, When I recrawl, I must delete indexes and index files, and re-create index file. If I only indexer segments that I have just fetched and merger with index existe, an error

[Nutch-general] How to determine the number of pages in the index?

2007-07-28 Thread Enzo Michelangeli
Is there a quick way of knowing how many pages are indexed (_not_ how many are referenced in crawldb as fetched URL's)? I could use Luke to peek inside the indexes and get the Number of documents, but they are located on a remote headless server with only SSH access... (OK, I actually did access

Re: [Nutch-general] How to determine the number of pages in the index?

2007-07-28 Thread Enzo Michelangeli
at org.apache.lucene.index.IndexReader.numDocs() method. You can write a simple utility to run it in the shell. On 7/28/07, Enzo Michelangeli [EMAIL PROTECTED] wrote: Is there a quick way of knowing how many pages are indexed (_not_ how many are referenced in crawldb as fetched URL's)? I could

[Nutch-general] URL to RSS (i.e., opensearch) doesn't include the app name

2007-07-11 Thread Enzo Michelangeli
Not sure if this is a bug or a misconfiguration on my side, but here we go: I installed the Nutch searcher webapp just by dropping its WAR file in Tomcat's webapps directory, so the main URL for it is (for nutch-0.9.war) http://localhost:8080/nutch-0.9/ . After I perform a search, in the

Re: [Nutch-general] integrate Nutch into my php front page

2007-06-29 Thread Enzo Michelangeli
Another way would be to rewrite a search.jsp so that it return XML or JSON rather than HTML, and then have the PHP code place a GET to that page and parse the results (the SOLR approach, so to speak). The JVM (and Tomcat) should obviously be run, but that could be done on a different machine.

Re: [Nutch-general] integrate Nutch into my php front page

2007-06-29 Thread Enzo Michelangeli
has supported this for a long time already, and many people make good use of it. -Roger - Original Message - From: Enzo Michelangeli [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, June 30, 2007 11:08 AM Subject: Re: integrate Nutch into my php front page Another way

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl the actual pages like you would a intranet crawl. For example, lets say that I have 20 urls in my set from the DmozParser. Lets also say that I want to go into the

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 5:48 PM Enzo Michelangeli wrote: - Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl the actual pages like you would

[Nutch-general] Incremental indexing

2007-06-12 Thread Enzo Michelangeli
As the size of my data keeps growing, and the indexing time grows even faster, I'm trying to switch from a reindex all at every crawl model to an incremental indexing one. I intend to keep the segments separate, but I want to index only the segment fetched during the last cycle, and then merge

Re: [Nutch-general] crawling by ip range

2007-06-12 Thread Enzo Michelangeli
I have written a custom URLFilter that resolves the hostname into an IP address and checks the latter against a GeoIP database. Unfortunately the source code was developed under a commercial contract, and is not freely available. Enzo - Original Message - From: Cesar Voulgaris [EMAIL

Re: [Nutch-general] Cache problem,

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Phạm Hải Thanh [EMAIL PROTECTED] Sent: Tuesday, June 12, 2007 9:29 AM Hi all, I have problem witch cache, after crawling searching successfully. The cache page is display with square question marks, plz take a look at

Re: [Nutch-general] Cache problem,

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Phạm Hải Thanh [EMAIL PROTECTED] Sent: Tuesday, June 12, 2007 10:06 AM Oops, I am sorry, here is the link: http:// 203.162.71.66:8080/cached.jsp?idx=0id=1 I also think this is a an issue of encoding too :( It looks fine to me, both with Firefox and MSIE 7

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-08 Thread Enzo Michelangeli
- Original Message - From: Doğacan Güney [EMAIL PROTECTED] Sent: Friday, June 08, 2007 3:49 PM [...] Any idea? This will certainly help a lot. If it is not too much trouble, can you add debug outputs for hashCodes of conf objects (both for the one in the cache and for the parameter,

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-08 Thread Enzo Michelangeli
- Original Message - From: Doğacan Güney [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, June 08, 2007 8:27 PM Subject: Re: Loading mechnism of plugin classes and singleton objects [...] This is strange, because, as you can see below, the strings that make keys and values of conf

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-07 Thread Enzo Michelangeli
.. that's all I have to say at the moment. On 6/5/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, It seems that plugin-loading code is somehow broken. There is some discussion going on about this on http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y . On 6/5/07, Enzo

Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-05 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Monday, June 04, 2007 2:05 PM Er... I saw it mentioned at http://wiki.apache.org/nutch/FetchOptions , so I thought it was for real... Sorry, this page is wrong and should be corrected - some of the options listed

Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-05 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Tuesday, June 05, 2007 4:56 PM [...] You can achieve a somewhat similar effect by controlling the number of fetcher threads. I realize this is not as accurate as a specific control mechanism, but so far it was

[Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-04 Thread Enzo Michelangeli
I have a question about the loading mechanism of plugin classes. I'm working with a custom URLFilter, and I need a singleton object loaded and initialized by the first instance of the URLFilter, and shared by other instances (e.g., instantiated by other threads). I was assuming that the

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-04 Thread Enzo Michelangeli
it in my plugin's finalize() method, AND ensuring that the finalizers be called on exit by placing somewhere in the initialization code the (deprecated) call System.runFinalizersOnExit(true) . Enzo - Original Message - From: Enzo Michelangeli [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent

[Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-03 Thread Enzo Michelangeli
In my case (with Nutch 0.8), it seems not: I set it to 500, and the fetcher still saturates the 1.5 Mbit/s link... Is it supposed to work for the total bandwidth, or for each thread? Enzo - This SF.net email is sponsored

Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-03 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Monday, June 04, 2007 1:31 AM Enzo Michelangeli wrote: In my case (with Nutch 0.8), it seems not: I set it to 500, and the fetcher still saturates the 1.5 Mbit/s link... Is it supposed to work for the total bandwidth

Re: [Nutch-general] Parallelizing URLFiltering

2007-06-01 Thread Enzo Michelangeli
- Original Message - From: Dennis Kubes [EMAIL PROTECTED] Sent: Friday, June 01, 2007 12:44 PM [...] We are also using BIND and our current index is 52,519,267 pages so you should be fine with this. I think djbdns is just easier to use. Are you using any big DNS caches as backups?

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 2:25 PM Are you running jobs in the local mode? In distributed mode filtering is naturally parallel, because you have as many concurrent lookups as there are map tasks. I'm just using the

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to) respond that quickly ... Then why

[Nutch-general] Parallelizing URLFiltering

2007-05-30 Thread Enzo Michelangeli
Is there a way of parallelizing URLFiltering over multiple threads? After all, the URLFilters themselves must already be thread-safe, or else they would have problems during fetching. The reason why I'm asking is I have a custom URLFilter that needs to make calls to the DNS resolver, and

Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Enzo Michelangeli
The ssh client is provided by the OpenSSH package, which can be installed through the Cygwin setup (under the net category). Enzo - Original Message - From: Ilya Vishnevsky [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, May 30, 2007 7:56 PM Subject: Nutch on Windows. ssh:

Re: [Nutch-general] Deleting crawl still gives proper results

2007-05-28 Thread Enzo Michelangeli
the whole crawldb? Can anyone please tell me where does it cache the whole crawldb? I don't think it is possible to cache it on RAM. Is it cached in some location on the hard disk. Please clarify this point. On 5/27/07, Enzo Michelangeli [EMAIL PROTECTED] wrote: - Original Message

Re: [Nutch-general] Deleting crawl still gives proper results

2007-05-26 Thread Enzo Michelangeli
- Original Message - From: Manoharam Reddy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, May 26, 2007 6:23 PM After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete,

Re: [Nutch-general] nutch-site.xml vs. nutch-default.xml

2007-05-26 Thread Enzo Michelangeli
I have a similar problem, but only under Cygwin. Apparently, changes to the value of plugin.includes are noticed only if made in nutch-default.xml, not in nutch-site.xml . The same happened to another user (see http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200609.mbox/[EMAIL

[Nutch-general] Filtering links from crawldb

2007-05-24 Thread Enzo Michelangeli
I understand that mergedb ... -filter can be used to remove links that do not meet the filtering requirements of the active URLFilters. However, mergedb operates on the whole crawldb, and can be very slow. Is there a way of enforcing filtering at updatedb time, preventing the unfetchable links

[Nutch-general] Getting Nutch running with UTF-8

2007-05-03 Thread Enzo Michelangeli
At http://wiki.apache.org/nutch/GettingNutchRunningWithUtf8 it is suggested, in order to handle UTF-8 characters in GET parameters, to change the configuration of the application server. Why can't the webapp just switch the request object to UTF-8 encoding, e.g. by placing in the head section