On 4/26/23 15:48, Nicolas George wrote:
David Christensen (12023-04-26):
I suggest hashing the document content rather than the URL. This would work
nicely for static documents.
That will be very convenient to retrieve the document content from the
URL.
My suggestion assumes that the URL => hash => content mapping is saved
somehow. For example, save the content in a file named after the hash
and save the URL in a file whose name is the hash plus a suffix.
Finding a document by URL then becomes a grep(1) invocation.
Things get more interesting when you approach the problem as a database.
Save the content wherever and put the metadata into a table -- content
hash (primary key), URL, download timestamp, author, subject, title,
keywords, etc.. Create fully inverted indexes. Create a search engine.
Create a spider. Implementation could range from a CSV/TSV flat-file
and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and
beyond (NoSQL, N-tier). There are distributed file sharing systems
based on such ideas.
David