Hi, all:
       I've got such a problem: there are millions of URLs in the
database, and more URLs are being obtained. In order to update the
database with new URLs, firstly, I have to judge whether a certain URL
exists, if not then insert it into the database.
       Directly using [ if not exists...] sentences poses large
pressure on database, for it's impossible to create an index on 'URL'
column. The reason is some of the URLs have a length more than 1024
bytes, which exceeds the limit of SQL 2005.
       My solution is system.collections.hashtable. It's quite simple
to implement the logic, however, the memory consumption is rather
large. I've got about 1.5 million URLs, loading all of them into the
hashtable costs more than 1GB memory. Of course, 1GB is acceptable
now, but if the amount of URL doubles or even triples, the memory
consumption could be a big problem.
       I've also 'devised' some other solutions:
       1. Get the MD5 code of each URL, then insert them into a RB-
tree or B-tree.
       2. At first pick the string between the begining 'http://' and
fist '/' after 'http://', then save it in a hashtable, after that,
save the reminding string in a sorted list or B-tree/RB-tree.
       Considering the balance between time and memory consumption,
which do you think would be the best method, or there're better
way(there must be) to deal with it?
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks

Reply via email to