On Nov 20, 2011, at 8:33, Phoenix Kiula <phoenix.ki...@gmail.com> wrote:

> Hi.
> 
> Want to start another thread, loosely related to the performance
> problems thread I have going.
> 
> Need some DB design guidance from the gurus here.
> 
> My big table now has about 70 million rows, with the following columns:
> 
> 
> alias           | character varying(35)
> url             | text
> modify_date     | timestamp without time zone
> ip              | bigint
> 
> 

While the schema is useful you need to provide HOW the data is being used if 
you want to help on finding ways to improve performance.

> For each IP address (user of my application) I want to have a unique
> URL. So I used to have a UNIQUE constraint on IP, URL.

Give a base URL can you encode an algorithm to generate the user-specific URL 
on-demand; then maybe cache that result in the application.

> But the index
> based on this became huge, as some URLs are gigantic. so I introduced

What does this mean?  Are there any patterns to the URLs that you can leverage 
(like, say, grouping them by domain name)?  Is there a lot of overlap between 
users so that having a URL table with a biting PK would make a difference?


> an md5 of the URL:
> 
> 
> url_md5             | varchar(32)
> 
> 
> I now have two scenarios:
> 
> 1. To have an index (unique?) on "(ip, url_md5)"
> 
> 2. To not have an index on just the "ip". This way a query that tries
> to match   "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
> still look only at the ip bit of the index, then refine it with the
> url_md5.
> 
> The good thing about #2 is the size of index remains very small with
> only a bigint field (ip) being indexed.
> 
> The bad thing about #2 is that each query of "...WHERE ip = 999 AND
> url_md5 = '<md5 here>'..."  will have to refine the indexed IP. If one
> IP address has put in a lot of URLs, then this becomes a bit slow. As
> is now happening, where I have users who have over 1 million URLs
> each!

Create a additional partial index on the URL for any IP address with more than 
X number of records?  You smallish users the only need to use the IP. Index 
while the big ones use that PLUS their personal URL index.

> 
> Questions:
> 
> 1. Instead of md5, is there any integer hashing algorithm that will
> allow me to have a bigint column and save a lot hopefully in both
> storage space and speed?  (Some very useful points mentioned here:
> http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
> )
> 
> 2. If I do go with the above scenario #1 of a joint index, is there
> any way I can save space and maintain speed? Partitioning etc are out
> of the question.
> 

If you are going to discount the feature whose implementation solves this 
specific problem then you are basically asking the list to solve your specific 
problem and, from my comment above, to do so without providing sufficient 
details as to how your application works.

Also, WTF do you mean by "etc".  If you are going to discount something from 
consideration you should be able to exactly specify what it is.

Furthermore, if you ask the question and exclude possible solutions you should 
explain why you cannot use them so that people will not propose other solutions 
that would have the same faults.

> With a growing web database, I am sure many people face this
> situation. Are nosql type databases the only sane solution to such
> massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

You would implement these before you would partition?

There are likely multiple solutions to your problem but, again, simply giving a 
table schema doesn't help it determining which ones are feasible.

> 
> Many thanks for any ideas or pointers!
> 

The only data ignorant, and thus generally useful, PostgreSQL solution is table 
partitioning.  

Use It.

My other questions, while an interesting thought exercise, need intimate 
knowledge of the data to even evaluate if they make sense.

So, in short, use partitions.  If you cannot, provide reasons why and then 
include more details about the application and data so that meaningful 
solutions have a chance to be suggested.

David J.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to