Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-24 Thread Phoenix Kiula
On Mon, Nov 21, 2011 at 6:14 PM, Tomas Vondra t...@fuzzy.cz wrote:
.

 An index on (a, b) can be used for queries involving only a but not for
 those involving only b.

 That is not true since 8.2 - a multi-column index may be used even for
 queries without conditions on leading columns. It won't be as effective as
 with conditions on leading columns, because the whole index must be
 scanned, but it's usually much cheaper than keeping two indexes (memory
 requirements, overhead when inserting data etc.)

 Check this:
 http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html




Thanks Tomas. VERY useful information.

I've decided to go with a unique multicolumn index for now.

Will ask the experts here if I see some issues..

Thanks!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-21 Thread Tomas Vondra
On 21 Listopad 2011, 4:17, David Johnston wrote:
 On Nov 20, 2011, at 20:50, Phoenix Kiula phoenix.ki...@gmail.com wrote:

 On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
 gavinflo...@archidevsys.co.nz wrote:

 How about having 2 indexes: one on each of ip  url_md5? Pg will
 combine the
 indexes as required, or will just use one if that is best.



 Thanks Gavin. Question: what if I have a joined index? If from a
 joined index I only use the first column (say, ip) will a joined
 index still be used?

 It is cleaner to create two indexes for the two columns. Which is
 recommended?


 An index on (a, b) can be used for queries involving only a but not for
 those involving only b.

That is not true since 8.2 - a multi-column index may be used even for
queries without conditions on leading columns. It won't be as effective as
with conditions on leading columns, because the whole index must be
scanned, but it's usually much cheaper than keeping two indexes (memory
requirements, overhead when inserting data etc.)

Check this:
http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html



Tomas


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-21 Thread Tomas Vondra
 On Nov 20, 2011, at 21:33, Phoenix Kiula phoenix.ki...@gmail.com wrote:

 My big table now has about 70 million rows, with the following columns:

  alias   | character varying(35)
  url | text
  modify_date | timestamp without time zone
  ip  | bigint


 For each IP address (user of my application) I want to have a unique
 URL. So I used to have a UNIQUE constraint on IP, URL. But the index
 based on this became huge, as some URLs are gigantic. so I introduced
 an md5 of the URL:

  url_md5 | varchar(32)

I don't understand this. So you want to allow exactly one URL for an IP
address? Or do you want to allow only one row with the same (IP, URL)
values? Because that's exactly what a UNIQUE index on (IP, URL) does.

If you want to allow just a single URL for an IP, you should create an
index on IP only.

 I now have two scenarios:

 1. To have an index (unique?) on (ip, url_md5)

 2. To not have an index on just the ip. This way a query that tries
 to match   ...WHERE ip = 999 AND url_md5 = 'md5 here'... will
 still look only at the ip bit of the index, then refine it with the
 url_md5.

Have you actually performed any benchmarks with this? Just create the
indexes, run several (say 100) queries and that should give you an idea
which of the options is better.

We can speculate on the pros/cons of those options, but both are viable
under certain conditions. For example if there's just a very small number
of URLs for an IP, then #2 is probably going to be better. But if there's
enormous number of URLs per IP, then a multi-column index is probably
going to perform better.

But those are only guesses - try both options, run a few queries (not on 
the overloaded system you're struggling with - that'd skew the results)
and show us EXPLAIN ANALYZE of the queries.

 Questions:

 1. Instead of md5, is there any integer hashing algorithm that will
 allow me to have a bigint column and save a lot hopefully in both
 storage space and speed?  (Some very useful points mentioned here:
 http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
 )

You can obviously take an MD5 hash, which is just a 16-byte value and use
the first 4 bytes as an integer. The problem with this is that it probably
significantly increases the collision. I.e. it will indicate two URLs to
be the same, although the URLs are different.

 2. If I do go with the above scenario #1 of a joint index, is there
 any way I can save space and maintain speed? Partitioning etc are out
 of the question.

What is the index size, anyway? You've mentioned it's huge, but what is
the actual size?

 With a growing web database, I am sure many people face this
 situation. Are nosql type databases the only sane solution to such
 massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Well, partitioning is usually a good approach to problems like these.

Have you thought about moving the URLs into a separate table? I.e. instead
of this table

 alias   | character varying(35)
 url | text
 url_md5 | character varying(32)
 modify_date | timestamp without time zone
 ip  | bigint

you'd have two tables - one for keeping the URLs:

 id  | integer
 url | text
 url_md5 | character varying(32)

and the current one, referencing the URLs

 alias   | character varying(35)
 url_id  | integer
 modify_date | timestamp without time zone
 ip  | bigint

That'd allow you to create a UNIQUE index on (ip, url_id), which should be
much smaller than the current one. But handling the inserts would be
significantly more complex (you'd have to check existence of the URL,
insert it etc.).

Tomas


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Phoenix Kiula
Hi.

Want to start another thread, loosely related to the performance
problems thread I have going.

Need some DB design guidance from the gurus here.

My big table now has about 70 million rows, with the following columns:


 alias   | character varying(35)
 url | text
 modify_date | timestamp without time zone
 ip  | bigint


For each IP address (user of my application) I want to have a unique
URL. So I used to have a UNIQUE constraint on IP, URL. But the index
based on this became huge, as some URLs are gigantic. so I introduced
an md5 of the URL:


 url_md5 | varchar(32)


I now have two scenarios:

1. To have an index (unique?) on (ip, url_md5)

2. To not have an index on just the ip. This way a query that tries
to match   ...WHERE ip = 999 AND url_md5 = 'md5 here'... will
still look only at the ip bit of the index, then refine it with the
url_md5.

The good thing about #2 is the size of index remains very small with
only a bigint field (ip) being indexed.

The bad thing about #2 is that each query of ...WHERE ip = 999 AND
url_md5 = 'md5 here'...  will have to refine the indexed IP. If one
IP address has put in a lot of URLs, then this becomes a bit slow. As
is now happening, where I have users who have over 1 million URLs
each!

Questions:

1. Instead of md5, is there any integer hashing algorithm that will
allow me to have a bigint column and save a lot hopefully in both
storage space and speed?  (Some very useful points mentioned here:
http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
)

2. If I do go with the above scenario #1 of a joint index, is there
any way I can save space and maintain speed? Partitioning etc are out
of the question.

With a growing web database, I am sure many people face this
situation. Are nosql type databases the only sane solution to such
massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Many thanks for any ideas or pointers!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Phoenix Kiula
On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula phoenix.ki...@gmail.com wrote:
 Hi.

 Want to start another thread, loosely related to the performance
 problems thread I have going.

 Need some DB design guidance from the gurus here.

 My big table now has about 70 million rows, with the following columns:


  alias           | character varying(35)
  url             | text
  modify_date     | timestamp without time zone
  ip              | bigint


 For each IP address (user of my application) I want to have a unique
 URL. So I used to have a UNIQUE constraint on IP, URL. But the index
 based on this became huge, as some URLs are gigantic. so I introduced
 an md5 of the URL:


  url_md5             | varchar(32)


 I now have two scenarios:

 1. To have an index (unique?) on (ip, url_md5)

 2. To not have an index on just the ip. This way a query that tries
 to match   ...WHERE ip = 999 AND url_md5 = 'md5 here'... will
 still look only at the ip bit of the index, then refine it with the
 url_md5.

 The good thing about #2 is the size of index remains very small with
 only a bigint field (ip) being indexed.

 The bad thing about #2 is that each query of ...WHERE ip = 999 AND
 url_md5 = 'md5 here'...  will have to refine the indexed IP. If one
 IP address has put in a lot of URLs, then this becomes a bit slow. As
 is now happening, where I have users who have over 1 million URLs
 each!

 Questions:

 1. Instead of md5, is there any integer hashing algorithm that will
 allow me to have a bigint column and save a lot hopefully in both
 storage space and speed?  (Some very useful points mentioned here:
 http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
 )

 2. If I do go with the above scenario #1 of a joint index, is there
 any way I can save space and maintain speed? Partitioning etc are out
 of the question.

 With a growing web database, I am sure many people face this
 situation. Are nosql type databases the only sane solution to such
 massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

 Many thanks for any ideas or pointers!




I thought of adding a bigserial (serial8) column instead of
varchar(32) for the md5. But postgresql tells me that:

--
ERROR:  type bigserial does not exist
--

Why is this? Why can't I create a column with this type? Whats the
current syntax?

Thanks.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Gregg Jaskiewicz
partition your table if it is too big.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread David Johnston
On Nov 20, 2011, at 8:33, Phoenix Kiula phoenix.ki...@gmail.com wrote:

 Hi.
 
 Want to start another thread, loosely related to the performance
 problems thread I have going.
 
 Need some DB design guidance from the gurus here.
 
 My big table now has about 70 million rows, with the following columns:
 
 
 alias   | character varying(35)
 url | text
 modify_date | timestamp without time zone
 ip  | bigint
 
 

While the schema is useful you need to provide HOW the data is being used if 
you want to help on finding ways to improve performance.

 For each IP address (user of my application) I want to have a unique
 URL. So I used to have a UNIQUE constraint on IP, URL.

Give a base URL can you encode an algorithm to generate the user-specific URL 
on-demand; then maybe cache that result in the application.

 But the index
 based on this became huge, as some URLs are gigantic. so I introduced

What does this mean?  Are there any patterns to the URLs that you can leverage 
(like, say, grouping them by domain name)?  Is there a lot of overlap between 
users so that having a URL table with a biting PK would make a difference?


 an md5 of the URL:
 
 
 url_md5 | varchar(32)
 
 
 I now have two scenarios:
 
 1. To have an index (unique?) on (ip, url_md5)
 
 2. To not have an index on just the ip. This way a query that tries
 to match   ...WHERE ip = 999 AND url_md5 = 'md5 here'... will
 still look only at the ip bit of the index, then refine it with the
 url_md5.
 
 The good thing about #2 is the size of index remains very small with
 only a bigint field (ip) being indexed.
 
 The bad thing about #2 is that each query of ...WHERE ip = 999 AND
 url_md5 = 'md5 here'...  will have to refine the indexed IP. If one
 IP address has put in a lot of URLs, then this becomes a bit slow. As
 is now happening, where I have users who have over 1 million URLs
 each!

Create a additional partial index on the URL for any IP address with more than 
X number of records?  You smallish users the only need to use the IP. Index 
while the big ones use that PLUS their personal URL index.

 
 Questions:
 
 1. Instead of md5, is there any integer hashing algorithm that will
 allow me to have a bigint column and save a lot hopefully in both
 storage space and speed?  (Some very useful points mentioned here:
 http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
 )
 
 2. If I do go with the above scenario #1 of a joint index, is there
 any way I can save space and maintain speed? Partitioning etc are out
 of the question.
 

If you are going to discount the feature whose implementation solves this 
specific problem then you are basically asking the list to solve your specific 
problem and, from my comment above, to do so without providing sufficient 
details as to how your application works.

Also, WTF do you mean by etc.  If you are going to discount something from 
consideration you should be able to exactly specify what it is.

Furthermore, if you ask the question and exclude possible solutions you should 
explain why you cannot use them so that people will not propose other solutions 
that would have the same faults.

 With a growing web database, I am sure many people face this
 situation. Are nosql type databases the only sane solution to such
 massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

You would implement these before you would partition?

There are likely multiple solutions to your problem but, again, simply giving a 
table schema doesn't help it determining which ones are feasible.

 
 Many thanks for any ideas or pointers!
 

The only data ignorant, and thus generally useful, PostgreSQL solution is table 
partitioning.  

Use It.

My other questions, while an interesting thought exercise, need intimate 
knowledge of the data to even evaluate if they make sense.

So, in short, use partitions.  If you cannot, provide reasons why and then 
include more details about the application and data so that meaningful 
solutions have a chance to be suggested.

David J.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Adrian Klaver
On Sunday, November 20, 2011 7:12:59 am Phoenix Kiula wrote:
 On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula phoenix.ki...@gmail.com 
wrote:

 
 I thought of adding a bigserial (serial8) column instead of
 varchar(32) for the md5. But postgresql tells me that:
 
 --
 ERROR:  type bigserial does not exist
 --
 
 Why is this? Why can't I create a column with this type? Whats the
 current syntax?

bigserial is not a type so much as a macro that creates a bigint column with 
attached sequence. 

Example:

test(5432)aklaver=\d pk_test   

   
Table public.pk_test  

   
 Column |  Type   | Modifiers 
+-+---
 id | integer | not null
 fld_1  | text| 
Indexes:
pk PRIMARY KEY, btree (id)

test(5432)aklaver=ALTER TABLE pk_test ADD column bg bigserial;
NOTICE:  ALTER TABLE will create implicit sequence pk_test_bg_seq for serial 
column pk_test.bg
ALTER TABLE
test(5432)aklaver=\d pk_test 
 Table public.pk_test
 Column |  Type   |  Modifiers   
+-+--
 id | integer | not null
 fld_1  | text| 
 bg | bigint  | not null default nextval('pk_test_bg_seq'::regclass)
Indexes:
pk PRIMARY KEY, btree (id)

 
 Thanks.

-- 
Adrian Klaver
adrian.kla...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Gavin Flower

On 21/11/11 02:33, Phoenix Kiula wrote:

Hi.

Want to start another thread, loosely related to the performance
problems thread I have going.

Need some DB design guidance from the gurus here.

My big table now has about 70 million rows, with the following columns:


  alias   | character varying(35)
  url | text
  modify_date | timestamp without time zone
  ip  | bigint


For each IP address (user of my application) I want to have a unique
URL. So I used to have a UNIQUE constraint on IP, URL. But the index
based on this became huge, as some URLs are gigantic. so I introduced
an md5 of the URL:


  url_md5 | varchar(32)


I now have two scenarios:

1. To have an index (unique?) on (ip, url_md5)

2. To not have an index on just the ip. This way a query that tries
to match   ...WHERE ip = 999 AND url_md5 = 'md5 here'... will
still look only at the ip bit of the index, then refine it with the
url_md5.

The good thing about #2 is the size of index remains very small with
only a bigint field (ip) being indexed.

The bad thing about #2 is that each query of ...WHERE ip = 999 AND
url_md5 = 'md5 here'...  will have to refine the indexed IP. If one
IP address has put in a lot of URLs, then this becomes a bit slow. As
is now happening, where I have users who have over 1 million URLs
each!

Questions:

1. Instead of md5, is there any integer hashing algorithm that will
allow me to have a bigint column and save a lot hopefully in both
storage space and speed?  (Some very useful points mentioned here:
http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
)

2. If I do go with the above scenario #1 of a joint index, is there
any way I can save space and maintain speed? Partitioning etc are out
of the question.

With a growing web database, I am sure many people face this
situation. Are nosql type databases the only sane solution to such
massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Many thanks for any ideas or pointers!

How about having 2 indexes: one on each of ip  url_md5? Pg will combine 
the indexes as required, or will just use one if that is best.


Why don't you have a time zone on your timestamp???


Regards,
Gavin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Phoenix Kiula
On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:

 How about having 2 indexes: one on each of ip  url_md5? Pg will combine the
 indexes as required, or will just use one if that is best.



Thanks Gavin. Question: what if I have a joined index? If from a
joined index I only use the first column (say, ip) will a joined
index still be used?

It is cleaner to create two indexes for the two columns. Which is recommended?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Ondrej Ivanič
Hi,

On 21 November 2011 00:33, Phoenix Kiula phoenix.ki...@gmail.com wrote:
 Hi.

 Want to start another thread, loosely related to the performance
 problems thread I have going.

 Need some DB design guidance from the gurus here.

 My big table now has about 70 million rows, with the following columns:

You can put different tables/indexes on different disks using
tablespaces. For example, one tablespace for all tables and another
one for all indexes (and  change random_page_cost and seq_page_cost
appropriately ie. lower random_page_cost). It is a good idea to put
pg_xlog on the separate drive too.

Sometimes Postgres just can't utilise all available resources properly
and you can get results faster by running query over multiple
connections. It could be worth to investigate pg-poolII's parallel
query mode. You don't need multiple servers - just setup multiple PG
instances on the same physical machine (up to one PG instance per
core, with lower shared_mem, ...). Alternative could be parallel DB
like Postgres-XC (http://wiki.postgresql.org/wiki/Postgres-XC), ...

-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread David Johnston
On Nov 20, 2011, at 20:50, Phoenix Kiula phoenix.ki...@gmail.com wrote:

 On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
 gavinflo...@archidevsys.co.nz wrote:
 
 How about having 2 indexes: one on each of ip  url_md5? Pg will combine the
 indexes as required, or will just use one if that is best.
 
 
 
 Thanks Gavin. Question: what if I have a joined index? If from a
 joined index I only use the first column (say, ip) will a joined
 index still be used?
 
 It is cleaner to create two indexes for the two columns. Which is recommended?
 

An index on (a, b) can be used for queries involving only a but not for those 
involving only b.

Neither is recommended - both have their benefits and limitations.

David J.





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...

2011-11-20 Thread Gavin Flower

On 21/11/11 14:50, Phoenix Kiula wrote:

On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower
gavinflo...@archidevsys.co.nz  wrote:


How about having 2 indexes: one on each of ip  url_md5? Pg will combine the
indexes as required, or will just use one if that is best.



Thanks Gavin. Question: what if I have a joined index? If from a
joined index I only use the first column (say, ip) will a joined
index still be used?
It will use the joined index if the planner thinks it is worth it.  
However, it is harder for the planner to jusify using the index for the 
second field when the query does not restrict on the first field (I am 
assuming it can, if required).




It is cleaner to create two indexes for the two columns. Which is recommended?
If you are frequently just using one or other of the indexes and/or 
could effectively use a joined index in both directins - then szeparate 
indexes will probably be better.


If your predominant query mode can use just use the one joined index 
effectively, then that would be better.


Consider the amount of RAM the indexes and table data will take up.

The advantages of indexing 2 fields separately compared to one =joined 
index are: that if you only need either single field index, it will take 
up less RAM and be also be quicker to read from disk.  Plus the 2 single 
field indexes can be used together for queiries that use both fields.  
The costs are that when both indexes need to be used, there is a little 
bit more processing involved, and 2 single field indexes take up more 
RAM than a single joined index.


So the answer is 'it depends...'!




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general