Re: [Nutch-general] Deleting crawl still gives proper results

Enzo Michelangeli Mon, 28 May 2007 08:17:52 -0700

Not crawldb, and surely not entire files, but information about the indexes. 
If you modify directory information while files are still open by a process 
(e.g. by renaming a directory that contains them, and create a new directory 
with the old name) the process keeps accessing the original files on disk 
until it closes and reopens them (hence my question about mergesegs and 
mergedb).


----- Original Message ----- 
From: "Manoharam Reddy" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, May 28, 2007 1:53 PM
Subject: Re: Deleting crawl still gives proper results


> The webapp caches the whole crawldb? Can anyone please tell me where
> does it cache the whole crawldb? I don't think it is possible to cache
> it on RAM. Is it cached in some location on the hard disk.
>
> Please clarify this point.
>
> On 5/27/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
>> ----- Original Message -----
>> From: "Manoharam Reddy" <[EMAIL PROTECTED]>
>> To: <[EMAIL PROTECTED]>
>> Sent: Saturday, May 26, 2007 6:23 PM
>>
>> > After I create the crawldb after running bin/nutch crawl, I start my
>> > Tomcat server. It gives proper search results.
>> >
>> > What I am wondering is that even after I delete, the 'crawl' folder,
>> > the search page still gives proper search results. How is this
>> > possible? Only after I restart the Tomcat server, it stops giving
>> > results.
>>
>> The webapp seems to cache data. I have a related problem: updates to the
>> indexes are only noticed after restarting Tomcat (so I have scheduled a
>> nightly cron job to do that).
>>
>> Question for the Ones Who Know: in "bin/nutch mergesegs", can I use the 
>> same
>> directory for input and output?
>>
>> For example:
>>
>>  bin/nutch mergesegs crawl/segments -dir crawl/segments
>>
>> Same for mergedb: can I issue:
>>
>>   bin/nutch mergedb crawl/crawldb crawl/crawldb
>>
>> At present I pass through temporary directories, and then I switch them 
>> in
>> place of the old ones with a couple of "mv", but I don't know if that's
>> necessary, or may even be harmful (for example, leaving the webapp, 
>> unaware
>> of the "mv", pointing to the inode of the old directory). And I noticed 
>> that
>> "bin/nutch mergedb" does not create the output directory until it's done, 
>> so
>> I wonder if the explicit use of a temporary directory in my scripts is
>> redundant.
>>
>> Enzo
>>
>>
>>
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Deleting crawl still gives proper results

Reply via email to