[algogeeks] Re: Judging whether a URL exists among millions, insert if not

Fred Thu, 21 Aug 2008 22:01:05 -0700

On Aug 21, 12:38 pm, "Ashish Chugh" <[EMAIL PROTECTED]> wrote:
> Few more suggestions,
> Instead of
> select count(*) from table where HASH_CODE=hc and URL='urlToFind'
> to
> select count( HASH_CODE) from table where HASH_CODE=hc
> is better, since HASH_CODE is unique.
>
> You can cache all hash codes or put into Lucene based Index and search from
> there.
> Either you cache or index it using Lucene based indexing you need to synch
> it with database.
>
> Regards,
> /Ashish
>
>
>
> On Thu, Aug 21, 2008 at 1:10 AM, Abdul Habra <[EMAIL PROTECTED]> wrote:
> > I agree with Ashish. Use hashCode.
> > Here is my suggestion:
> > Add a new column to your db table, lets call it HASH_CODE
> > whenever you add a url row, populate the HASH_CODE with the hashcode of the
> > URL.
> > When you want to search for the existance of a URL:
>
> > select count(*) from table where HASH_CODE=hc and URL='urlToFind'
>
> > where hc is the hash code of the urlToFind.
>
> > index your table on HASH_CODE
>
> > On Wed, Aug 20, 2008 at 4:42 AM, Fred <[EMAIL PROTECTED]> wrote:
>
> >> Hi, all:
> >>       I've got such a problem: there are millions of URLs in the
> >> database, and more URLs are being obtained. In order to update the
> >> database with new URLs, firstly, I have to judge whether a certain URL
> >> exists, if not then insert it into the database.
> >>       Directly using [ if not exists...] sentences poses large
> >> pressure on database, for it's impossible to create an index on 'URL'
> >> column. The reason is some of the URLs have a length more than 1024
> >> bytes, which exceeds the limit of SQL 2005.
> >>       My solution is system.collections.hashtable. It's quite simple
> >> to implement the logic, however, the memory consumption is rather
> >> large. I've got about 1.5 million URLs, loading all of them into the
> >> hashtable costs more than 1GB memory. Of course, 1GB is acceptable
> >> now, but if the amount of URL doubles or even triples, the memory
> >> consumption could be a big problem.
> >>       I've also 'devised' some other solutions:
> >>       1. Get the MD5 code of each URL, then insert them into a RB-
> >> tree or B-tree.
> >>       2. At first pick the string between the begining 'http://' and
> >> fist '/' after 'http://', then save it in a hashtable, after that,
> >> save the reminding string in a sorted list or B-tree/RB-tree.
> >>       Considering the balance between time and memory consumption,
> >> which do you think would be the best method, or there're better
> >> way(there must be) to deal with it?
>
> --
> ///\\
> (@ @)
> +----oOO----(_)-----------------------+
> | ~~~ |
> | Phone: +91 9968158191 |
> | ~~~ |
> | Disclaimer: |
> | The Statement and options |
> | expressed here are my own |
> | do not necessarily represent |
> | those of MPS Tech. |
> +-----------------oOO-------------------+
> |__|__|
> || ||
> ooO Ooo

Thanks, actually, I want a solution which do not add column to the
database.
Index on hashcode/MD5 is a good idea, but I'm wondering when their are
millions of URLs, building and searching the index may also be time-
consuming.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---
[algogeeks] Re: Judging whether a URL exists among millions, insert if not

Reply via email to