[GENERAL] histogram

2011-04-30 Thread Joel Reymont
I have a column of 2 million float values from 0 to 1.

I would like to figure out how many values fit into buckets spaced by 0.10, 
e.g. from 0 to 0.10, from 0.10 to 0.20, etc.

What is the best way to do this?

Thanks, Joel

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] histogram

2011-04-30 Thread Joel Reymont
Thank you Thomas!

Is there a way for the code below to determine the number of rows in the table 
and use it?

Thanks, Joel

On Apr 30, 2011, at 5:48 PM, Thomas Markus wrote:

 Hi,
 
 try something like this:
 
 select
trunc(random() * 10.)/10.
, count(*)
 from
generate_series(1,200)
 group by 1 order by 2

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] histogram

2011-04-30 Thread Joel Reymont
What is the meaning of 

group by 1 order by 2

e.g. what to the numbers 1 and 2 stand for?

What would change if I do the following?

group by 1 order by 1

On Apr 30, 2011, at 5:48 PM, Thomas Markus wrote:

 Hi,
 
 try something like this:
 
 select
trunc(random() * 10.)/10.
, count(*)
 from
generate_series(1,200)
 group by 1 order by 2

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] histogram

2011-04-30 Thread Joel Reymont
I think this should do what I want

select trunc(distance * 10.)/10., count(*)
from doc_ads
group by 1 order by 1

  Thanks, Joel


--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] optimizing a cpu-heavy query

2011-04-27 Thread Joel Reymont
Tom,

On Apr 26, 2011, at 5:00 PM, Tom Lane wrote:

 For another couple orders of magnitude, convert the sub-function to C code.  
 (I don't think you need
 a whole data type, just a function that does the scalar product.)

That's a 30x speedup, from 12 minutes down to 38s. Thanks Tom!

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] optimizing a cpu-heavy query

2011-04-26 Thread Joel Reymont
Folks,

I'm trying to optimize the following query that performs KL Divergence [1]. As 
you can see the distance function operates on vectors of 150 floats. 

The query takes 12 minutes to run on an idle (apart from pgsql) EC2 m1 large 
instance with 2 million documents in the docs table. The CPU is pegged at 100% 
during this time. I need to be able to both process concurrent distance queries 
and otherwise use the database.

I have the option of moving this distance calculation off of PG but are there 
other options?

Is there anything clearly wrong that I'm doing here?

Would it speed things up to make the float array a custom data type backed by C 
code?

Thanks in advance, Joel

[1] http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

---

CREATE DOMAIN topics AS float[150];
CREATE DOMAIN doc_id AS varchar(64);

CREATE TABLE docs
(
  id  serial,
  doc_id  doc_id NOT NULL PRIMARY KEY,
  topics  topics NOT NULL
);

CREATE OR REPLACE FUNCTION docs_within_distance(vec topics, threshold float) 
RETURNS TABLE(id doc_id, distance float) AS $$
BEGIN
RETURN QUERY
SELECT * 
FROM (SELECT doc_id, (SELECT sum(vec[i] * ln(vec[i] / topics[i])) 
  FROM generate_subscripts(topics, 1) AS i
  WHERE topics[i]  0) AS distance
  FROM docs) AS tab
WHERE tab.distance = threshold;
END;
$$ LANGUAGE plpgsql;


--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] tuning on ec2

2011-04-26 Thread Joel Reymont
I'm running pgsql on an m1.large EC2 instance with 7.5gb available memory. 

The free command shows 7gb of free+cached. My understand from the docs is that 
I should dedicate 1.75gb to shared_buffers (25%) and set effective_cache_size 
to 7gb. 

Is this correct? I'm running 64-bit Ubuntu 10.10, e.g. 

Linux ... 2.6.35-28-virtual #50-Ubuntu SMP Fri Mar 18 19:16:26 UTC 2011 x86_64 
GNU/Linux

Thanks, Joel

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] tuning on ec2

2011-04-26 Thread Joel Reymont

On Apr 26, 2011, at 4:31 PM, Scott Marlowe wrote:

 It's a reasonable start.  However, if you consistently using less than
 that in aggregate then lowering it is fine.

Is there a way to tell if I consistently use less than that in aggregate?

 What's your work_mem and max_connections set to?

I have the default settings, e.g. work_mem = 1MB and max_connections = 100.

I'm looking to process 400 requests per second, though. What should I use for 
the above?

Thanks, Joel

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general