[algogeeks] Re: Judging whether a URL exists among millions, insert if not

Ashish Chugh Wed, 20 Aug 2008 21:38:46 -0700

Few more suggestions,
Instead of
select count(*) from table where HASH_CODE=hc and URL='urlToFind'
to
select count( HASH_CODE) from table where HASH_CODE=hc
is better, since HASH_CODE is unique.


You can cache all hash codes or put into Lucene based Index and search from
there.
Either you cache or index it using Lucene based indexing you need to synch
it with database.

Regards,
/Ashish

On Thu, Aug 21, 2008 at 1:10 AM, Abdul Habra <[EMAIL PROTECTED]> wrote:

> I agree with Ashish. Use hashCode.
> Here is my suggestion:
> Add a new column to your db table, lets call it HASH_CODE
> whenever you add a url row, populate the HASH_CODE with the hashcode of the
> URL.
> When you want to search for the existance of a URL:
>
> select count(*) from table where HASH_CODE=hc and URL='urlToFind'
>
> where hc is the hash code of the urlToFind.
>
> index your table on HASH_CODE
>
>
>
> On Wed, Aug 20, 2008 at 4:42 AM, Fred <[EMAIL PROTECTED]> wrote:
>
>>
>> Hi, all:
>>       I've got such a problem: there are millions of URLs in the
>> database, and more URLs are being obtained. In order to update the
>> database with new URLs, firstly, I have to judge whether a certain URL
>> exists, if not then insert it into the database.
>>       Directly using [ if not exists...] sentences poses large
>> pressure on database, for it's impossible to create an index on 'URL'
>> column. The reason is some of the URLs have a length more than 1024
>> bytes, which exceeds the limit of SQL 2005.
>>       My solution is system.collections.hashtable. It's quite simple
>> to implement the logic, however, the memory consumption is rather
>> large. I've got about 1.5 million URLs, loading all of them into the
>> hashtable costs more than 1GB memory. Of course, 1GB is acceptable
>> now, but if the amount of URL doubles or even triples, the memory
>> consumption could be a big problem.
>>       I've also 'devised' some other solutions:
>>       1. Get the MD5 code of each URL, then insert them into a RB-
>> tree or B-tree.
>>       2. At first pick the string between the begining 'http://' and
>> fist '/' after 'http://', then save it in a hashtable, after that,
>> save the reminding string in a sorted list or B-tree/RB-tree.
>>       Considering the balance between time and memory consumption,
>> which do you think would be the best method, or there're better
>> way(there must be) to deal with it?
>>
>
>
> >
>


-- 
///\\
(@ @)
+----oOO----(_)-----------------------+
| ~~~ |
| Phone: +91 9968158191 |
| ~~~ |
| Disclaimer: |
| The Statement and options |
| expressed here are my own |
| do not necessarily represent |
| those of MPS Tech. |
+-----------------oOO-------------------+
|__|__|
|| ||
ooO Ooo

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

[algogeeks] Re: Judging whether a URL exists among millions, insert if not

Reply via email to