I try to index local web-server with such sructure / -news1 --20021110 ---1.html ---2.html ... ---k.html -20021109 ... -20020101 -news2 -news3 I set manualy date files MODIFICATION by command: touch -m mmdd1111 *.html After index I test the mysql database - the date is correct. Then i search and sort by date: I get documents sorted in this maner: html-docs from news1 sorted by date, then html-docs from news2 sorted by date too, then html-docs from news3 too. Why ? If I use the search for last week I get, for ex. 10 documents, if I search after it for last the week, I get 10 documents too. The reall number of documents for this period is more. Why? What can I do ?
> With the current state of the Internet, practical determination of > document dates is not as simple as it should be. > The sources of timing information. > 1. The HTTP server Last-Modified: field > 2. Meta information inside the document > 3. URL, for example, some sites can put documents dated 02.05.2002 > in the /2002/05/02 folder. > 4. Dates in the document text > 5. Persistent search engine database, which, for every document > keeps the date when the document first appeared in the database > and the date when its modification was detected during reindexing. > The document date from (1) is useless as nowadays most documents > are generated and the (1) date is always identical to the request > time. For static files, the HTTP server copies this date from > the file timestamp, which also often is not related to the real > date of the document modification or creation. ASPSeek as I understand > uses dates from (1). > The date (2) is most reliable and can be easily extracted, however > most websites do not specify it, or specify it in an arbitrary format. > The dates (3) and (4) can be reliably extracted only for some selected > Websites and types of documents. > This leaves (5) as the most reliable and universal source of the > document data information. The drawback of this method is that > document dates are restricted by the time of the database > creation. However, usually, it is most important to discriminate > between documents only inside a relatively small interval of time > from the present. Therefore, a database that exists for a month > can be already an important tool for document date information. > If I understand it correctly, ASPSeek does not allow to keep a > persistent database for a long time, because the obsolete > documents are not automatically removed. Therefore from time to > time the database must be cleared and recreated. > I plan to modify ASPSeek code to introduce automatic removal > of deleted documents. This removal should not be done when > the server returned "Not found" code, because the reason may > be that the server or the connection are down. Several requests > during a predefined interval of time must be made to assure that > document is indeed removed from the server for good. > After the database persistency is achieved, the next step is > to introduce gathering of data information from all available > sources, its judicious use in time ranking, and a nuanced > presentation of this information to search engine users. > Gregory Kozlovsky > Project Manager for Information Systems Tel: +41 (0)1 632 63 > 70 > International Relations and Security Network (ISN) Fax: +41 (0)1 632 14 > 13 > Center for Security Studies > Email: [EMAIL PROTECTED] > Swiss Federal Institute of Technology (ETH) http://www.isn.ch > Leonhardshalde 21, ETH-Zentrum / LEH > CH-8092 Z�rich, Switzerland > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:yesin@;sipria.msk.ru] > Sent: Montag, 11. November 2002 17:03 > To: [EMAIL PROTECTED] > Subject: [aseek-users] Dose anybody in the world uses feature of search on a > range of dates ? > Dose anybody in the world uses feature of search on a range of dates > ? Can I get link to example ? I can't use this feature in ASPSeek > 1.2.8...12.10 > Thank you. > mailto:yesin@;sipria.msk.ru > http://www.sipria.ru mailto:yesin@;sipria.msk.ru http://www.sipria.ru
