[GENERAL] GiST indices and statistical approximations

2008-08-05 Thread Rajarshi Guha

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi, I was reading the Wikipedia entry on GiST (http:// 
en.wikipedia.org/wiki/GiST) and it mentions


"""
Although originally designed for answering Boolean selection queries,  
GiST can also support nearest-neighbor search, and various forms of  
statistical approximation over large data sets.

"""

Could anybody provide pointers to papers or online pages that discuss  
the use of GiST indexes for statistical approximations?


Thanks,

- ---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
- ---
A memorandum is written not to inform the reader,
but to protect the writer.
-- Dean Acheson


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkiZF38ACgkQZqGSLFHnnoQ36QCfeaU6mDid4v5DDH4V2vDMdaZr
At8AoJQj9ZHxDbGiarfwNHe9aIYr1pnB
=Jrgz
-END PGP SIGNATURE-

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is a 'pairwise' possible / feasible in SQL?

2008-08-04 Thread Rajarshi Guha

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


On Aug 4, 2008, at 4:55 PM, Francisco Reyes wrote:

On 2:08 pm 08/04/08 Rajarshi Guha <[EMAIL PROTECTED]> wrote:

paircount
- -
123 & 456   1
667 & 879   2





 select a.cid as ac, b.cid as bc, count(*) from aic_cid a left  
outer join
aic_cid b on a.cid <>b.cid and a.id = b.id where b.cid is not null  
group by

a.cid, b.cid order by a.cid;
 ac  | bc  | count
-+-+---
 123 | 456 | 1
 123 | 667 | 1
 123 | 878 | 1
 123 | 879 | 1
 456 | 123 | 1
 456 | 878 | 1
 667 | 123 | 1
 667 | 879 | 2
 667 | 999 | 1
 878 | 123 | 1
 878 | 456 | 1
 879 | 123 | 1
 879 | 667 | 2
 879 | 999 | 1
 999 | 667 | 1
 999 | 879 | 1

Is that what you are looking for?


Thanks a lot - this is very close. Ideally, I'd want unique pairs, so  
the row


879 | 999 | 1

is the same as

999 | 879 | 1

Can these duplicates be avoided?

- -------
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
- ---
How I wish I were what I was when I wished I were what I am.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkiXbe8ACgkQZqGSLFHnnoRXPACeMcPqXG4QIf308ufnAHev9hlG
EEoAoLzU5tmL1ipiUIp69N9mOvnsfrES
=JOg1
-END PGP SIGNATURE-

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] is a 'pairwise' possible / feasible in SQL?

2008-08-04 Thread Rajarshi Guha

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi, I have a table of the form

aid cid
-    -
1  123
2  456
3  667
3  879
3  123
4  878
4  456
4  123
5  999
5  667
5  879

My goal is to identify for each pair of cid values, the number of  
times they have the same aid


Thus for example I would have

paircount
- -
123 & 456   1
667 & 879   2
...

I currently do this by using a Python script to do a pairwise lookup, as

select count(aid) where cid = 123 and cid = 456;

but I was wondering whether I could construct a single SQL statement  
to do this.


Any pointers would be appreciated,

Thanks,
- ---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
- ---
All great discoveries are made by mistake.
-- Young


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkiXRYUACgkQZqGSLFHnnoTJJQCgtvromGcYfQVGsekGFQJU6vTo
oHgAnjpfKSkZR0MqBjdE6WFGO8SBr2WH
=zZJk
-END PGP SIGNATURE-

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] a SQL query question

2008-07-28 Thread Rajarshi Guha

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


On Jul 28, 2008, at 10:18 PM, Rajarshi Guha wrote:

aid  pid  nmol
- ---  ---  
323   100
245   3445
478   12

I can easily do the first step, but am struggling to make the SQL  
for the second step. Any pointers would be appreciated


Thanks to the posters for helpful solutions

- ---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
- ---
Alcohol, an alternative to your self
- 'Alcohol' by the Bare Naked Ladies


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkiOnG8ACgkQZqGSLFHnnoR2qQCeMntkTpqR/ZaVS/nY1izO5u5y
0FYAn0dwi8v0jSB4OvK4OnwMr+7ypQPp
=pNGY
-END PGP SIGNATURE-

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] a SQL query question

2008-07-28 Thread Rajarshi Guha

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi, I have a table of the form

aid  pid  nmol
- ---  ---  
123   34
245   3445
323   100
478   12
545   14
645   200
7null null

In general, aid is unique, pid and nmol are non-unique.

What I'm trying to do is to select those rows where pid is not null,  
grouped by pid. So I'd get the following


aid  pid  nmol
- ---  ---  
123   34
323   100
245   3445
545   14
645   200
478   12

From within each group I'd like to select the row that has the  
maximum value of nmol. So I'd end up with


aid  pid  nmol
- ---  ---  
323   100
245   3445
478   12

I can easily do the first step, but am struggling to make the SQL for  
the second step. Any pointers would be appreciated




- ---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
- ---
"whois awk?", sed Grep.


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkiOfd4ACgkQZqGSLFHnnoSOKACguioqdY0/Ut7su2KUYu+IRP7D
xOUAoKZsQKveWM52RTe422i3SRGWZk2u
=Xs+n
-END PGP SIGNATURE-

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PL/pgsql function handle CUBE values

2008-01-16 Thread Rajarshi Guha


On Jan 16, 2008, at 3:41 PM, Colin Wetherbee wrote:


Rajarshi Guha wrote:
Hi, I am trying to write a PL/pgsql function that will take a CUBE  
variable (which will be a 1D point) and a double precision variable.

If the input CUBE is defined as
'(x,y,z)'::cube
the function would then return a CUBE value of the form
'(x+R,y+R,z+R),(x-R,y-R,z-R)'::cube
where R is the second argument.
The problem I'm having is to actually add R to the individual  
components of the CUBE variable. I can't cast CUBE to float[] and  
I don't see anyway to get at the individual components of the CUBE.


I haven't tested this, but it looks like you can use cube_subset()  
to do that.


From [0]:

  cube_subset(cube, int[]) returns cube

Further, it looks like you're actually trying to create a function  
that will do precisely what cube_enlarge() does.


Indeed! sorry for not giving the docs a thorough a reading

Thanks for the pointer

-------
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
...but there was no one in it...
- RG



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[GENERAL] PL/pgsql function handle CUBE values

2008-01-16 Thread Rajarshi Guha
Hi, I am trying to write a PL/pgsql function that will take a CUBE  
variable (which will be a 1D point) and a double precision variable.


If the input CUBE is defined as

'(x,y,z)'::cube

the function would then return a CUBE value of the form

'(x+R,y+R,z+R),(x-R,y-R,z-R)'::cube

where R is the second argument.

The problem I'm having is to actually add R to the individual  
components of the CUBE variable. I can't cast CUBE to float[] and I  
don't see anyway to get at the individual components of the CUBE.


Any pointers would be appreciated.

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
"355/113 -- Not the famous irrational number PI,
but an incredible simulation!"



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


[GENERAL] looping over the rows in a table

2007-11-09 Thread Rajarshi Guha

Hi, this is slightly offtopic, but is based on Postgres:

I have a table with 10M rows and I have a Python script using psycopg  
that needs to look at each row of the table. My current strategy is  
to do in the Python script


cursor.execute("select acol from atable")
while True:
   ret = cursor.fetchone()
   if not ret: break

However if I understand correctly Postgres will basically try and  
return *all* the rows of the table as the result set, thus taking a  
long time and probably running out of memory.


Is there a way I can modify the SQL or do something on the Postgres  
side, so that I can loop over all the rows in the table?


Thanks,

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
A bug in the hand is better than one as yet undetected.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] keeping an index in memory

2007-10-22 Thread Rajarshi Guha


On Oct 21, 2007, at 12:56 PM, Gregory Stark wrote:


"Rajarshi Guha" <[EMAIL PROTECTED]> writes:


The table itself is about 10M rows corresponding to 14GB.


Each row is on average 1.4kB ?


Yes, though some rows may 10's of Kb


Perhaps you should send more details of the
table definition and the typical size of each column. It's possible  
you have
the columns you're selecting on being stored out of line  
("toasted") which
would hurt performance if you're often accessing many of those  
columns.


The SELECT is simply selecting the cube column and another text  
column which just contains ID's. One of the columns can be large and  
does involve toasting - but that column is never selected in the query


However, after running a few random queries, it appears that the  
index is loaded into memory after which all my queries (I tested  
10,000) run in under 10 sec. Even for relatively large bounding boxes  
I get under 60 sec query times.


Thanks to everybody for pointing me in the right direction!

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
The Heineken Uncertainty Principle:
You can never be sure how many beers you had last night.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] keeping an index in memory

2007-10-21 Thread Rajarshi Guha


On Oct 21, 2007, at 10:40 AM, Martijn van Oosterhout wrote:


On Sun, Oct 21, 2007 at 07:36:00AM -0400, Bill Moran wrote:
What version of PG are you using and what is your shared_buffers  
setting?


With 8G of RAM, you should start with shared_buffers around 2 -  
3G, if

you're using a modern version of PG.  With that much shared memory, a
large portion of that index should stay in RAM, as long as it's being
used often enough that PG doesn't swap it for other data.


With that much memory, the index is likely to remain in memory no
matter what size shared_memory he has. Anything in shared_memory is
going to be in the system cache anyway. I wonder if there's something
else we havn't been told, like how big the actual table is and whether
there are any other large tables/indexes.


The table itself is about 10M rows corresponding to 14GB. The only  
other index on this table is a btree index whose size is ~300MB. The  
machine is not running anything else.


Now, it might just be the case that given the size of the index, I  
cannot make bounding box queries (which will use the CUBE index) go  
any faster. But I am surprised that that the other type of query  
(using cube_distance which by definition must use a seq scan) is only  
slightly longer. If nothing else, scanning through 14GB of data  
should be 3 times slower than scanning through 3GB of data.



-------
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
All laws are simulations of reality.
-- John C. Lilly



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] keeping an index in memory

2007-10-21 Thread Rajarshi Guha


On Oct 21, 2007, at 7:36 AM, Bill Moran wrote:


Rajarshi Guha <[EMAIL PROTECTED]> wrote:


Hi, relating to my previous queries on doing spatial searches on 10M
rows, it seems that most of my queries return within 2 minutes.
Generally this is not too bad, though faster is always better.

Interestingly, it appears that the CUBE index for the table in
question is about 3GB (the table itself is about 14GB). Not knowing
the details of the postgres internals, I assume that when a query
tries to use the index, it will need to read a 3GB file. Is this a
correct assumption?

In such a situation, is there a way to keep the index in memory? My
machine has 8GB installed and currently has about 7.4GB free RAM (64
bit linux 2.6.9)


Free or cached/buffered?  Your OS should be using most of that to
buffer disk blocks.


Aah, correct. Yes they are cached/buffered


A side effect of the size of the index is that if I do a query that
performs a seq scan (say using cube_distance) it takes longer than
when an index is used, but not significantly longer. And this is on a
10M row table.

What strategies do people follow when the index becomes very big?


What version of PG are you using and what is your shared_buffers  
setting?


8.2.5

My original shared_buffers setting was 128MB.


With 8G of RAM, you should start with shared_buffers around 2 - 3G, if
you're using a modern version of PG.


I can do that but I'm a little confused. Earlier postings on the list  
indicate that shared_buffers should be about 10% of the system RAM  
and that effective_cache_size can be a large fraction of RAM.


As a result I had effective_cache_size set to 2500MB

Thanks for the pointers

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
How I wish I were what I was when I wished I were what I am.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[GENERAL] keeping an index in memory

2007-10-20 Thread Rajarshi Guha
Hi, relating to my previous queries on doing spatial searches on 10M  
rows, it seems that most of my queries return within 2 minutes.  
Generally this is not too bad, though faster is always better.


Interestingly, it appears that the CUBE index for the table in  
question is about 3GB (the table itself is about 14GB). Not knowing  
the details of the postgres internals, I assume that when a query  
tries to use the index, it will need to read a 3GB file. Is this a  
correct assumption?


In such a situation, is there a way to keep the index in memory? My  
machine has 8GB installed and currently has about 7.4GB free RAM (64  
bit linux 2.6.9)


A side effect of the size of the index is that if I do a query that  
performs a seq scan (say using cube_distance) it takes longer than  
when an index is used, but not significantly longer. And this is on a  
10M row table.


What strategies do people follow when the index becomes very big?

Thanks,

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
Does Ramanujan know Polish?
   --  E.B. Ross



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] speeding up CUBE queries

2007-10-08 Thread Rajarshi Guha


On Oct 8, 2007, at 9:15 PM, Oleg Bartunov wrote:


I'm not sure, but probably mtree will be better ?


Thanks - do you have any pointers to this? Google only turns up an  
mtree utility that seems to be related to file system operations


---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
Disembowelling takes guts.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[GENERAL] speeding up CUBE queries

2007-10-08 Thread Rajarshi Guha
Hi, I'm running Postgres 8.2.4 and have installed the CUBE extension.  
Using this I had a 10M row table populated with 12-dimensional zero- 
volume cubes (i.e., 12D points).


My queries are of the form

select * from ctab where '(x1,x2,x3,...,x12),  
(y1,y2,y3,...,y12)'::cube @> cubeField;


So essentially I am asking for all rows that lie within the specified  
bounding box. Now the cubeField column has a GiST index on it. As a  
result in a number of cases I can get results in less than a minute  
(and in a few cases under 15 sec).


Now 1 minute is relatively long, but it's acceptable (but faster is  
always better). However there are some instances when a query takes 4  
to 5 minutes. This is problematic, but I'm not sure I see a solution.


One thing that I did observe is that the very long (4-minute) queries  
occur when the bounding box is very densely filled with points (based  
on knowledge of the dataset). Very fast queries occur when the  
bounding box is quite sparsely filled. Now it is also true that the  
12D space is not uniformly populated, so that probably has an effect.


I have altered the statistics count on the cubeField column to 100 so  
that vacuum analyze performs larger sampling, but that doesn't seem  
to help the timings for the more extreme queries.


Could anybody point me to a strategy to improve performance of the  
index on this dataset? I have done some searching wrt GIS databases  
and I found a post (http://postgis.refractions.net/pipermail/postgis- 
users/2002-October/001526.html) which talked about tuning a spatial  
index - but that's for 2D and it's not clear whether those ideas were  
implemented (and if so, whether they could be applied to my problem)


Are there any options to improving performance on the 12D case? Or am  
I stuck?


Thanks,

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
Accuracy, n.:
The vice of being right



---(end of broadcast)---
TIP 6: explain analyze is your friend


[GENERAL] using Tsearch2 for chemical text

2007-07-25 Thread Rajarshi Guha
Hi, I have a table with about 9M entries. The table has 2 fields: id  
and name which are of serial and text types respectively. I have a  
ordinary index on the text field which allows me to do searches in  
reasonable time. Most of my searches are of the form


select * from mytable where name ~ 'some text query'

I know that the Tsearch2 module will let me have very efficient text  
searches. But if I understand correctly, it's based on a language  
specific dictionary.


My problem is that the name column contains names of chemicals. Now  
for many cases this may simply be a number (1674-56-2) and in other  
cases it may be an alphanumeric string (such as (-)O-acetylcarnitine  
or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word  
(say viagra or calcium  chloride or pentathol).


My question is: will Tsearch2 be able to handle this type of text? Or  
will it be hampered by the fact that the bulk of the rows do not  
correspond to ordinary English


---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
My Ethicator machine must have had a built-in moral
compromise spectral phantasmatron! I'm a genius."
-Calvin



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] count the number of bits set to 1 in a bit string field

2007-07-16 Thread Rajarshi Guha


On Jul 15, 2007, at 7:20 PM, Ragnar wrote:


On sun, 2007-07-15 at 15:35 -0400, Rajarshi Guha wrote:

Hi, is there a built in function that will give me the number of bits
that are set to 1 in a bit string field?


no, but it should be trivial to do with pl/pgsql


Thanks for the pointer

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
Gravity brings me down.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


[GENERAL] count the number of bits set to 1 in a bit string field

2007-07-15 Thread Rajarshi Guha
Hi, is there a built in function that will give me the number of bits  
that are set to 1 in a bit string field?


Thanks,

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
Without love intelligence is dangerous;
without intelligence love is not enough.
-- Ashley Montagu



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] tables are not listable by \dt

2007-07-02 Thread Rajarshi Guha


On Jul 2, 2007, at 10:04 AM, Rajarshi Guha wrote:

Hi, we recently upgraded our Postgres instllation from 7.4 to 8.2  
by doing a dump and restore. Howveer after logging into the  
database (as a user that is not the superuser) and doing \dt I get  
the error:


No relations found

But when I do

SELECT relname, relpages FROM pg_class ORDER BY relpages DESC;

I get a list of the tables and their sizes.


Thanks to posters - it was indeed an error with the search path.

---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
If you don't get a good night kiss, you get Kafka dreams."
-Hobbes



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] tables are not listable by \dt

2007-07-02 Thread Rajarshi Guha
Hi, we recently upgraded our Postgres instllation from 7.4 to 8.2 by  
doing a dump and restore. Howveer after logging into the database (as  
a user that is not the superuser) and doing \dt I get the error:


No relations found

But when I do

SELECT relname, relpages FROM pg_class ORDER BY relpages DESC;

I get a list of the tables and their sizes.

I'm not even sure where to begin looking for the solution and any  
pointers would be much appreciated.


---
Rajarshi Guha  <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
---
A mathematician is a device for turning coffee into theorems.
-- P. Erdos



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[GENERAL] indexing array columns

2007-04-13 Thread Rajarshi Guha
Hi, I have a table of about 10M rows. It has two columns A and B, where
A is a text field and B is a real[12] field.

Now when this table is queried it is usually of the form:

select A from theTable where sim(B, C) > 0.8

Now C will be a 12-element array specified by the user and the value 0.8
can be arbitrary. The function, sim(), is essentially a similarity
function which simply computes the inverse of the Manhattan distance
between the query array and the rows of the array column. Right now the
above query uses a seq scan.

Furthermore, the values of the individual array elements for any given
row can vary from 0 to infinity (but for most cases will be numbers less
than 1000)

My question is: how can I index the column B so that such queries are
fast.

I realize that my table is essentially a collection of 12-dimensional
points and that I could replace my similarity function with a distance
function. 

Thus my query boils down to asking 'find me rows of the table that are
within X distance of my query'

I know that the GIS community deals with 2D points, but I'm not familiar
with this area and if I understand correctly, they use Euclidean
distances, where as I need Manhattan distances. 

What type of indexing, available in Postgres could be used for my
problem? Would it require me to implement my own indexing scheme?

Any pointers would be appreciated

Thanks,

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
Every little picofarad has a nanohenry all its own.
-- Don Vonada



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[GENERAL] passing an array type to a plpython procedure

2007-03-11 Thread Rajarshi Guha
Hi, I have a table in which a column is of type real[12]

I also have a function written in pl/Python that will accept two
arguments of type real[12].

However when I send in the values by doing:

select sim(ARRAY[1,1,1,1,1,1,1,1,1,1,1,1], ARRAY[1,1,1,1,1,1,1,1,1,1,1,1])

I get an error indicating that the input args are of type 'str' rather
than a list type as I would have expected. I found this posting
(http://www.thescripts.com/forum/thread399934.html) from 2005, which
indicates that a pl/Python procedure will not see an array type as array
but rather as a string.

Has this aspect been updated? My current code parses the string, to get
at the array elements (which is obviously prone to failure etc etc) and
I'm sure I could get a bit of a speed up if I could avoid this.  My code
looks like:

create or replace function momsim3d(real[12], real[12]) returns real as
'
query = args[0]
target = args[1]

query = query[1:len(query)-1]
target = target[1:len(target)-1]


query = [float(x) for x in query.split(",")]
target = [float(x) for x in target.split(",")]

sum = 0.0
for i in range(0,12): sum += abs(query[i] - target[i])
return 1.0/(1.0+sum/12.0)
' language plpythonu;

Ubuntu 6.10
Postgresql 7.4

Thanks,

-------
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
All the evidence concerning the universe has not yet been collected,
so there's still hope.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] suggestions on improving a query

2007-02-14 Thread Rajarshi Guha
On Wed, 2007-02-14 at 10:55 -0500, Tom Lane wrote:
> Rajarshi Guha <[EMAIL PROTECTED]> writes:
> > Clearly a big improvement in performance.
> 
> Huh?  It looks like exactly the same plan as before.  Any improvement
> you're seeing must be coming from cache effects.

Well the new run was done nearly 8 hours after the initial one - I
would've thought that the cache had been purged (?)

> > It looks like theres a big mismatch on the expected and observed costs and 
> > times.
> 
>  In fact I'd say the whole problem here
> is that the planner is being too optimistic about the benefits of a
> fast-start plan.  For whatever reason (most likely, an unfavorable
> correlation between dock.target and dockscore_plp.total), the desired
> rows aren't uniformly scattered in the output of the join, and so it's
> taking longer than expected to find 10 of them.

Is there any way to solve this? I've increased the statistics target on
dockscore_plp.total to 100 - does going higher help?

>From what you've said, it appears that the problem is arising due to
lack of correlation between two columns in two tables. 

This is strange since, out of 4.6M rows in dock, ~ 960K will be selected
and the corresponding 960K rows from dockscore_plp will be ordered and
then the top 10 will be taken.

So does the lack of correlation occur due to 'ordering' in the DB
itself? And if this is the case, how does one fix the lack of
correlation (if at all possible)?

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
Regular naps prevent old age
especially if you take them while driving



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] suggestions on improving a query

2007-02-14 Thread Rajarshi Guha
On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote:
> "Adam Rich" <[EMAIL PROTECTED]> writes:
> > This line:
> > Index Scan using plp_total_idx on dockscore_plp  
> > (cost=0.00..16733229.92 rows=4669988 width=80) 
> > (actual time=98.323..322537.605 rows=25197 loops=1)
> > Means the planner did what it did, because it estimated there would be
> > nearly 5 million rows.  However, there were only 25,000.

Sorry for not doing the obvious beforehand! I increased the statistics
target for some of the columns in some of the tables and then did a
vacuum analyze. Rerunning the query gives:

  
QUERY PLAN  
 
---
 Limit  (cost=0.00..397.24 rows=10 width=268) (actual 
time=98322.597..171721.583 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..37182572.57 rows=936023 width=268) (actual 
time=98322.590..171721.543 rows=10 loops=1)
 ->  Nested Loop  (cost=0.00..31580822.05 rows=936023 width=90) (actual 
time=98236.963..171379.151 rows=10 loops=1)
   ->  Index Scan using plp_total_idx on dockscore_plp  
(cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761 
rows=25197 loops=1)
   ->  Index Scan using dock_pkey on dock  (cost=0.00..3.14 rows=1 
width=18) (actual time=2.718..2.718 rows=0 loops=25197)
 Index Cond: (dock.dockid = "outer".id)
 Filter: (target = '1YC1'::text)
 ->  Index Scan using pubchem_compound_pkey on pubchem_compound  
(cost=0.00..5.97 rows=1 width=187) (actual time=34.221..34.223 rows=1 loops=10)
   Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
 Total runtime: 171722.964 ms
(10 rows)

Clearly a big improvement in performance.

(One question not directly related to the problem: when looking at the
output of explain analyze, I know that one is supposed to start at the
bottom and move up. Does that that the index scan on pubchem_compound is
being performed first? Or should I start from the innermost line?)

However it seems that it could still be improved:

   ->  Index Scan using plp_total_idx on dockscore_plp  (cost=0.00..16858401.83 
rows=4669988 width=80) (actual time=54.989..102775.761 rows=25197 loops=1)

It looks like theres a big mismatch on the expected and observed costs and 
times.

> The real problem here is that the planner is guessing that it won't take
> very long to find 10 rows satisfying the target = '1YC1' condition while
> scanning in dockscore_plp.total order.  So it chooses a plan that would
> have a long total runtime (notice the large cost estimates below the
> Limit) expecting that only a small fraction of that total will actually
> be expended.  The expectation seems a bit off unfortunately :-(.
> I can't tell from the given data whether the problem is just an
> overestimate of the frequency of target = '1YC1', or if there's an
> additional effect. 

I think that increasing the statistics has improved that.

>  For example, if that target value tended to only be
> associated with larger values of dockscore_plp.total, then a plan like
> this could lose big-time because it will have to scan a long way to find
> those rows.

This is not the case. The value '1YC1' will be associated with both high
and low values of dockscore_plp.total

What I would like my query to do is this:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Using dock.cid from the rows in 2., get the corresponding rows in
pubchem_compound

4. Sort and take the top 10 from step 2 (and associated rows in step 3)

However now that I have written this it seems that what I really want to
do is:

1. From dock.target find all rows = '1YC1'

2. Using dock.dockid of these rows, get the corresponding rows in
dockscore_plp

3. Sort and take the top 10

4. Get the corresponding rows from pubchem_compound.cid

The problem with this is that step is represented by the

dock.cid = pubchem_compound.cid

clause. It seems that if I had the cid column in dockscore_plp, then I
could do a sort+limit in dockscore_plp and then simply lookup the
corresponding (10) rows in pubchem_compound (rather than looking up 960K
rows). The downside to this is that there are 4 more tables like
dockscore_plp, and I would have to add a cid column to each of them -
which seems redundant.

Is it useful to increase redundancy to improve performance?

Thanks for the pointers,

---

Re: [GENERAL] suggestions on improving a query

2007-02-14 Thread Rajarshi Guha
On Tue, 2007-02-13 at 21:44 -0500, Tom Lane wrote:
> Rajarshi Guha <[EMAIL PROTECTED]> writes:
> > However the clause: 
> > dock.target = '1YC1' and
> > dock.dockid = dockscore_plp.id  
> > reduces the number of rows from 4.6M to 96K.
> 
> The planner seems to be estimating about ten times that many.  Perhaps
> increasing the statistics target for dock.target would help?

My original message had a typo: I expected that it should ~ 960K, so
postgres is working as expected. 

However increasing the statistics target for dock.target did lead to an
improvement in performance. Could this be because dock.target has only 5
unique values? So though the table has ~4.6M rows, each set of  ~960K
rows for dock.dockid is associated with a single value of dock.target.

Thanks,

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
All great ideas are controversial, or have been at one time.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[GENERAL] suggestions on improving a query

2007-02-13 Thread Rajarshi Guha
Hi, I have a query that involves 3 tables. T

select pubchem_compound.openeye_can_smiles, pubchem_compound.nist_inchi, 
dock.cid, dockscore_plp.*
from dock, dockscore_plp, pubchem_compound 
where
dock.target = '1YC1' and
dock.dockid = dockscore_plp.id  and
dock.cid = pubchem_compound.cid
order by dockscore_plp.total 
limit 10;

The output of explain analyze is

 Limit  (cost=0.00..387.36 rows=10 width=297) (actual 
time=242977.644..462748.215 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..37325186.12 rows=963575 width=297) (actual 
time=242977.638..462748.175 rows=10 loops=1)
 ->  Nested Loop  (cost=0.00..31523550.51 rows=963575 width=90) (actual 
time=242900.629..461810.902 rows=10 loops=1)
   ->  Index Scan using plp_total_idx on dockscore_plp  
(cost=0.00..16733229.92 rows=4669988 width=80) (actual time=98.323..322537.605 
rows=25197 loops=1)
   ->  Index Scan using dock_pkey on dock  (cost=0.00..3.15 rows=1 
width=18) (actual time=5.521..5.521 rows=0 loops=25197)
 Index Cond: (dock.dockid = "outer".id)
 Filter: (target = '1YC1'::text)
 ->  Index Scan using pubchem_compound_pkey on pubchem_compound  
(cost=0.00..6.01 rows=1 width=216) (actual time=93.699..93.704 rows=1 loops=10)
   Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text)
 Total runtime: 462748.439 ms
(10 rows)

Now, the tables 'dock' and 'dockscore_plp' have 4.6M rows and
'pubchem_compound' has 10M rows.

However the clause: 

dock.target = '1YC1' and
dock.dockid = dockscore_plp.id  

reduces the number of rows from 4.6M to 96K. I had figured that after
this the query would be very fast. But the explain analyze seems to
indicate that the dockscore_plp table is being sorted (using the index
plp_total_idx) in its entirety. This would then be the bottleneck

Is this a correct interpretation?

What I expected was that the sort would be applied to the 96K subset of
dockscore_plp, rather than the whole table. 

Is it possible to restructure the query such that the sort is done on
96K rows rather than 4.6M rows?

Thanks,


---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
"I'd love to go out with you, but my favorite commercial is on TV."



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[GENERAL] indexing for query speed - index row size exceeding btree maximum

2006-10-05 Thread Rajarshi Guha
Hi, I have a table with 8M rows. One of the fields is of type text and I
wanted to create an index on it to improve query times. Now this field
is a single string (ie not a piece of normal text) and is really an
identifier (< 100 chars).  I envisage queries like

select cid from tableName where fieldName = 'XYZ ... ';

So I had done something like

create index someName on tableName (fieldName);

However this returned with an error:

ERROR: index row size 2848 exceeds btree maximum, 2713

I noted that some other posters have faced this problem, but most of the
replies requested info regarding what the index was to be used for. I
also tried following one example where the index was created on the MD5
hash of the field being indexed - but this did not make my query times
significantly faster.

Is there a way for me to generate an index on this field so that my
query times can be reduced?

Thanks,

-------
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
-- Leibnitz



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[GENERAL] dumping 8M bit fields

2006-09-21 Thread Rajarshi Guha
Hi, I have a table with 8M rows and one of the fields is a bit (1024
bits) field. I am trying to dump the bit field for 8M rows to a file:

psql -U cicc3 -A -F " " -t -o pubchem.fp -c "select cid,gfp from
pubchem_compound;

However I get 

out of memory for query result

I am surprised since I did an equivalent query but selecting varchar
field for 6.9M rows which worked fine.

The machine is a 4-cpu Opteron machine, 16GB RAM.

Does anybody know why this might be the case?

Thanks,

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
A mouse is an elephant built by the Japanese.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [GENERAL] getting around 'out of memory' errors

2006-08-24 Thread Rajarshi Guha
On Thu, 2006-08-24 at 14:12 -0400, Tom Lane wrote:
> Rajarshi Guha <[EMAIL PROTECTED]> writes:
> > On Thu, 2006-08-24 at 10:50 -0400, Tom Lane wrote:
> >> I think you are getting burnt by the list of pending trigger actions
> >> to check the foreign-key references in B.
> 
> > Thanks for the pointer. I've dropped the constraint and am now running
> > the INSERT.
> > However when I look at the output of top, I'm seeing that the %MEM value
> > is continuously increasing and I'm worried that I'm going to hit the
> > same problem in a few hours.
> 
> Sure you got all the FK constraints involving table B?  Do you have any
> AFTER triggers other than the FK constraints?

Yes, \d B shows no constraints

> > One thing I did not mention previously is that table A has some
> > constraints on some fields (notably field b is specified to be NOT
> > NULL). 
> 
> Plain old CHECK constraints shouldn't matter.  Tell us more about those
> functions in the SELECT though --- what are they?

That would be a problem as I have not written them. I know the INSERT
(and PL/pgSQL function) and functions f1() and f2() work on some trivial
test tables - I should probably contact the developer.

Thanks,

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
A bug in the hand is better than one as yet undetected.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] getting around 'out of memory' errors

2006-08-24 Thread Rajarshi Guha
On Thu, 2006-08-24 at 10:50 -0400, Tom Lane wrote:
> Rajarshi Guha <[EMAIL PROTECTED]> writes:
> > I have a table, A with 15 fields, out of which I am interested in 2
> > fields, a and b. The table has 8,000,000 rows
> 
> > I have another table, B, which has 3 fields a, c, and d. The field a
> > references field a in table A. Table B is empty at this point.
> 
> > I tried to do an insert using the following command:
> > insert into B (a,c,d) select a, f1(b), f2(b) from A;
> 
> I think you are getting burnt by the list of pending trigger actions
> to check the foreign-key references in B.  Might be easiest to drop the
> foreign key constraint, fill table B, re-add the constraint.  I forget
> how smart 7.4 is about adding FK constraints exactly, but it shouldn't
> run out of memory anyway.

Thanks for the pointer. I've dropped the constraint and am now running
the INSERT.

However when I look at the output of top, I'm seeing that the %MEM value
is continuously increasing and I'm worried that I'm going to hit the
same problem in a few hours.

One thing I did not mention previously is that table A has some
constraints on some fields (notably field b is specified to be NOT
NULL). 

My understanding is that these constraints would not matter since I am
simply performing a SELECT - is this a correct assumption? Or should I
temporarily drop those constraints as well?

Thanks,

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
Q: What's yellow, linear, normed and complete?
A: A Bananach space.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] getting around 'out of memory' errors

2006-08-24 Thread Rajarshi Guha
Hi, I've recently been using PostgreSQL and am having some trouble
performing an insert. 

My situation is this:

I have a table, A with 15 fields, out of which I am interested in 2
fields, a and b. The table has 8,000,000 rows

I have another table, B, which has 3 fields a, c, and d. The field a
references field a in table A. Table B is empty at this point.

I tried to do an insert using the following command:

insert into B (a,c,d) select a, f1(b), f2(b) from A;

where f1() and f2() are some functions. This ran for more than 5 hours
and then exited with an error:

ERROR: out of memory
DETAIL: Failed on request of size 44

I wondered whether this was because temporary columns were being
generated. So I then wrote a small PL/pgSQL function:

create or replace function fpinsert() returns void as '
declare
  arow record;
  counter integer := 0;
begin
  for arow in select * from A loop
  execute ''insert into B (a, c, d) values (''
   || quote_literal(arow.a) || '',''
   || ''f1('' || quote_literal(arow.b) || ''),''
   || ''f2('' || quote_literal(arow.b) || ''));'';
  counter := counter + 1;
  end loop;
  return;
end;
' language 'plpgsql';

I assumed that this would run (possible slower) since it would loop over
the rows one by one. However this also failed with an out of memory
error (same as above, though the request size was 29).

Am I doing anything obviously wrong? I don't really care that the insert
takes a long time as it's a one time thing. But I can't see why my
PL/pgSQL function will also run out of memory, since it's operating row
by row.

Any pointers would be greatly appreciated.

The machine I'm running on has 16GB of RAM and I'm running PostgreSQL
7.4.13 on RedHat Enterprise Linux

Thanks,

---
Rajarshi Guha <[EMAIL PROTECTED]>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
All great ideas are controversial, or have been at one time.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster