[SQL] A DISTINCT problem removing duplicates

2008-12-09 Thread Richard Huxton
The scenario is - a list of documents, some of which may be (near)
duplicates of others, one document being in many duplicate-sets and a
duplicate-set containing many documents.

We want to see a list with only one document (any one) from each
duplicate set. There's an example script attached.

So:

documents  (docid SERIAL, title text, PRIMARY KEY (docid));
duplicates (docid int REFERENCES documents, dup_set SERIAL,
PRIMARY KEY (docid, dup_set));

This allows one document to be part of multiple duplicate sets, but
that's fine - this is a "fuzzy" match. If two documents match and one of
them is already in the duplicates table then I add the second with the
same dup_set value. If neither are present, generate a new set number.
Match documents in a well-defined order and it's all nice and simple.

A self-join on the duplicates table gives me a count of how many
duplicates each document has. A left-join from the documents table can
list documents and if/how many duplicates they have. The problem comes
when I don't want to see duplicates:

SELECT DISTINCT ON (dup_set)
ds.dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
documents d
LEFT JOIN
(
SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
FROM duplicates dup1
JOIN duplicates dup2 USING (dup_set)
GROUP BY dup1.docid, dup1.dup_set
) ds
USING (docid)
ORDER BY dup_set, docid
;

Documents without duplicates have a NULL dup_set. The DISTINCT ON
considers two nulls to be equal, which means we only ever see one
unduplicated document.

I've got two work-arounds. The first is to create a separate sequence
that doesn't overlap with dup_set's values and use that:

CREATE SEQUENCE not_duplicate_seq MINVALUE -99 MAXVALUE -1 CYCLE;

SELECT DISTINCT ON (dup_set)
COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set,
d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
...

That works, but is arguably a bit too "clever" if you know what I mean.

The other alternative is to separate duplicated and non-duplicated
documents and UNION them. That's simple enough to see what's happening
but does seem ugly.

Anyone got anything more elegant? I'm happy to alter the duplicates
table so long as it doesn't make it complicated to update.

-- 
  Richard Huxton
  Archonet Ltd
BEGIN;

CREATE SCHEMA duptest;

SET search_path = duptest;

CREATE TABLE documents (
	docid SERIAL NOT NULL,
	title text,
	PRIMARY KEY (docid)
);

CREATE TABLE duplicates (
	docid   int NOT NULL REFERENCES documents ON DELETE CASCADE,
	dup_set SERIAL NOT NULL,
	PRIMARY KEY (docid, dup_set)
);

-- Five documents
INSERT INTO documents (docid, title)
SELECT i, 'document number ' || i FROM generate_series(1, 6) i;

-- duplicates are (1,3) and (2,4) - 5,6 are not
INSERT INTO duplicates (docid, dup_set)
SELECT i, (i % 2)+1 FROM generate_series(1, 4) i;

SELECT setval('documents_docid_seq', (SELECT max(docid)+1 FROM documents));
SELECT setval('duplicates_dup_set_seq', (SELECT max(dup_set)+1 FROM duplicates));

-- This is a list of all documents with how many duplicates they have
SELECT dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
	documents d
LEFT JOIN
	(
		SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
		FROM duplicates dup1
		JOIN duplicates dup2 USING (dup_set)
		GROUP BY dup1.docid, dup1.dup_set
	) ds
USING (docid)
ORDER BY dup_set, docid
;

-- This DOESN'T work because nulls are considered equal by DISTINCT ON
SELECT DISTINCT ON (dup_set) 
	dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
	documents d
LEFT JOIN
	(
		SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
		FROM duplicates dup1
		JOIN duplicates dup2 USING (dup_set)
		GROUP BY dup1.docid, dup1.dup_set
	) ds
USING (docid)
ORDER BY dup_set, docid
;

-- Work around is to fake a unique id for non-duplicates
CREATE SEQUENCE not_duplicate_seq MINVALUE -99 MAXVALUE -1 CYCLE;

SELECT DISTINCT ON (dup_set) 
	COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set,
	d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
	documents d
LEFT JOIN
	(
		SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
		FROM duplicates dup1
		JOIN duplicates dup2 USING (dup_set)
		GROUP BY dup1.docid, dup1.dup_set
	) ds
USING (docid)
ORDER BY dup_set, docid
;

-- Second alternative - UNION duplicated and non-duplicated documents
SELECT DISTINCT ON (dup_set) 
	ds.dup_set, d.docid, d.title, num_dups
FROM
	documents d
LEFT JOIN
	(
		SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
		FROM duplicates dup1
		JOIN duplicates dup2 USING (dup_set)
		GROUP BY dup1.docid, dup1.dup_set
	) ds
USING (docid)
WHERE num_dups > 0
UNION ALL
SELECT
	ds.dup_set, d.docid, d.title, COALESCE(num_dups,0) AS num_dups
FROM
	documents d
LEFT JOIN
	(
		SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
		FROM duplicates dup1
		JOIN duplicates dup2 USING (dup_set)
		GROUP BY dup1.docid, dup1.dup_set
	) ds
USING (doci

Re: [SQL] DIVISION with Group By

2008-12-09 Thread Richard Huxton
hdhgdh mjhff wrote:
> Hai all,
> 
> I am new to SQL coding.
> I need some help in writing SQL
> In my task i have to calculate division of two sub queries.
> each sub query has its own group by clause.

Can you provide a simpler example? It's not obvious to me what you're
trying to do.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] A DISTINCT problem removing duplicates

2008-12-09 Thread Tom Lane
Richard Huxton <[EMAIL PROTECTED]> writes:
> Anyone got anything more elegant?

Seems to me that no document should have an empty dup_set.  If it's not
a match to any existing document, then immediately assign a new dup_set
number to it.

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] A DISTINCT problem removing duplicates

2008-12-09 Thread Richard Huxton
Tom Lane wrote:
> Richard Huxton <[EMAIL PROTECTED]> writes:
>> Anyone got anything more elegant?
> 
> Seems to me that no document should have an empty dup_set.  If it's not
> a match to any existing document, then immediately assign a new dup_set
> number to it.

That was my initial thought too, but it means when I actually find a
duplicate I have to decide which "direction" to renumber them in. It
also means probably keeping a summary table with counts to show which
are duplicates, since the duplicates table is now the same size as the
documents table.


-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] A DISTINCT problem removing duplicates

2008-12-09 Thread Tom Lane
Richard Huxton <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Richard Huxton <[EMAIL PROTECTED]> writes:
>>> Anyone got anything more elegant?
>> 
>> Seems to me that no document should have an empty dup_set.  If it's not
>> a match to any existing document, then immediately assign a new dup_set
>> number to it.

> That was my initial thought too, but it means when I actually find a
> duplicate I have to decide which "direction" to renumber them in.

Hmm, so you mean you might decide that two docs are duplicates sometime
after initially putting them both in the database?  Seems like you have
issues with that anyway.  If you already know A,B are dups and
separately that C,D are dups, and you later decide B and C are dups,
what do you do?

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] A DISTINCT problem removing duplicates

2008-12-09 Thread Richard Huxton
Tom Lane wrote:
> Richard Huxton <[EMAIL PROTECTED]> writes:
>> Tom Lane wrote:
>>> Richard Huxton <[EMAIL PROTECTED]> writes:
 Anyone got anything more elegant?
>>> Seems to me that no document should have an empty dup_set.  If it's not
>>> a match to any existing document, then immediately assign a new dup_set
>>> number to it.
> 
>> That was my initial thought too, but it means when I actually find a
>> duplicate I have to decide which "direction" to renumber them in.
> 
> Hmm, so you mean you might decide that two docs are duplicates sometime
> after initially putting them both in the database? 

Yep - checking for duplicates can be a slow process - it's O(n^2) over
the number of documents and document-comparisons are probably O(n^2)
over length (or number of similarly-sized word-runs anyway). I'm failing
 comparisons as early as I can, but there's a trade-off between speed
and false negatives.

> Seems like you have
> issues with that anyway.  If you already know A,B are dups and
> separately that C,D are dups, and you later decide B and C are dups,
> what do you do?

Not necessarily a problem. I'm using "duplicate" very loosely here -
it's more like "very similar to" so it's entirely possible to have sets
(a,b) (b,c) (c,d) and everything be valid just by adding sentences to
the end of each document. Similarity scoring should allow for
insertion/deletion of single words or whole (quite extensive) blocks of
text.

Of course at the moment, as I tweak what I mean by "duplicate" I have to
re-run the check over at least a sizeable chunk of the documents to see
if I prefer it.

Oh - the comparison is outside the DB at the moment, but it's based on
the stemmed tsvector of each document anyway, so it's crying out to be
pushed into the DB once I'm happy it works.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] store pdf files

2008-12-09 Thread ivan marchesini
Hi to all...
I need to create a db that contain link to some pdf files..
At the moment these are simple links (to the files that are stored into
the file system) storing paths into a column of a dbf table...

I need to manage this data considering that the db I'm going to create
will be moved in some months to another server... 
so I think that the link to the files positions into the file system
isn't a valid solution...

can you suggest me a better way to manage this pdf data?? 
Each pdf is quite small (<100k)

should I use BLOB?

many thanks

Ivan






-- 
Ti prego di cercare di non inviarmi files .dwg, .doc, .xls, .ppt.
Preferisco formati liberi.
Please try to avoid to send me .dwg, .doc, .xls, .ppt files.
I prefer free formats.
http://it.wikipedia.org/wiki/Formato_aperto
http://en.wikipedia.org/wiki/Open_format

Ivan Marchesini
Department of Civil and Environmental Engineering
University of Perugia
Via G. Duranti 93/a 
06125
Perugia (Italy)
Socio fondatore GFOSS "Geospatial Free and Open Source Software" 
http://www.gfoss.it
e-mail: [EMAIL PROTECTED]
[EMAIL PROTECTED]
tel: +39(0)755853760
fax (university): +39(0)755853756
fax (home): +39(0)5782830887
jabber: [EMAIL PROTECTED]


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] inconsistent automatic casting between psql and function

2008-12-09 Thread Stefano Buliani
Hello all,

I'm experiencing a strange problem with postgresql 8.3.4.

I have the following table

tx_queue
txid serial
securityid integer
portfolioid integer
datequeued timestamp default now()
tradedate date
numshares numeric(25,7)
transactiontype char(1)
tradeprice numeric(25,7)

every time a new tx is created in the table I need to check if it's the result 
of 2 previous transaction being aggregated:

I receive 2 tx with the following values:
securityid, portfolioid, tradedate, numshares, transactiontype, tradeprice
2, 1, '2008-12-08', 2, 'B', 15.23
2, 1, '2008-12-08', 6, 'B', 15.23

Later I'll receive another tx:
2, 1, '2008-12-08', 8, 'B', 15.23

This isn't a new trade but just the sum of the previous 2 it should be 
therefore ignored.

To create the tx in the table I use a function which receives all the values 
and runs the following query to check whether it's a sum of previous txs. (The 
tradedate is passed as a timestamp)

 SELECT INTO  vpsum sub1.possible_sum
FROM (
  SELECT tq.securityid, date_trunc('hour', tq.datequeued) AS 
split_tq_time, count(*) AS cnt,
  sum(tq.numshares) as possible_sum, tq.transactiontype, tq.tradeprice, 
tq.portfolioid
  FROM tx_queue AS tq
  WHERE tq.securityid= 2
AND tq.tradeprice = 15.23
AND tq.portfolioid = 1
AND tq.tradedate = '2008-12-08 02:00:00'
AND tq.datequeued + interval '1 hour' <= now() -- tx received more 
than 1 hour ago
  GROUP BY date_trunc('hour', tq.datequeued), tq.securityid, 
tq.portfolioid, tq.tradeprice, tq.transactiontype  
  HAVING count(*)>1
) AS sub1
WHERE sub1.possible_sum = 8.000
ORDER BY sub1.split_tq_time DESC
LIMIT 1;

If I run this query from the psql client it works just fine. From the function 
it doesn't return anything.
What I discovered is that for it to work from the function I need to explicitly 
cast the tradedate variable to DATE (ie '2008-12-08 02:00:00'::DATE - Note the 
field is of type date).

It would seem that the psql client casts the value automatically. 
Any reason why this should be? 
This inconsistent behaviour makes code much harder to debug.

Regards,
  Stefano


-
Stefano Buliani
Covestor

This message is intended solely for the recipient(s) to whom it is addressed.  
If you are not the intended recipient, you should not disclose, distribute or 
copy this email.  Please notify the sender immediately and delete this email 
from your system.

Re: [SQL] store pdf files

2008-12-09 Thread Frank Bax

ivan marchesini wrote:

Hi to all...
I need to create a db that contain link to some pdf files..
At the moment these are simple links (to the files that are stored into
the file system) storing paths into a column of a dbf table...

I need to manage this data considering that the db I'm going to create
will be moved in some months to another server... 
so I think that the link to the files positions into the file system

isn't a valid solution...

can you suggest me a better way to manage this pdf data?? 
Each pdf is quite small (<100k)


should I use BLOB?



Leave the files on filesystem and code the highest level directory into 
a function.


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] inconsistent automatic casting between psql and function

2008-12-09 Thread Richard Huxton
Stefano Buliani wrote:
> If I run this query from the psql client it works just fine. From the 
> function it doesn't return anything.
> What I discovered is that for it to work from the function I need to 
> explicitly cast the tradedate variable to DATE (ie '2008-12-08 
> 02:00:00'::DATE - Note the field is of type date).
> 
> It would seem that the psql client casts the value automatically. 
> Any reason why this should be? 
> This inconsistent behaviour makes code much harder to debug.

Seems unlikely. Can't reproduce the problem assuming you're using a
quoted literal as your query shows.

=> CREATE FUNCTION date_test() RETURNS boolean AS $$BEGIN RETURN
current_date = '2008-12-09 02:00:00'; END;$$ LANGUAGE plpgsql;
CREATE FUNCTION
=> SELECT date_test();
 date_test
---
 t

=> SELECT current_date = '2008-12-09 02:00:00';
 ?column?
--
 t

On the other hand, if you are using variable interpolation:

CREATE OR REPLACE FUNCTION date_test2(timestamp) RETURNS boolean AS
$$BEGIN RETURN current_date = $1; END;$$ LANGUAGE plpgsql;
CREATE FUNCTION
=> SELECT date_test2('2008-12-09 02:00:00');
 date_test2

 f

=> SELECT current_date = '2008-12-09 02:00:00'::timestamp;
 ?column?
--
 f

That's because a quoted literal isn't necessarily a timestamp. Without
context it could be anything, and in the context of comparing to a date
the planner probably tries to make it a date.

Your variable is definitely a timestamp though (you've said so
explicitly) so PG has to decide what it means to compare a date to a
timestamp. It decides the reasonable approach is to turn the date into a
timestamp (by adding '00:00:00' to it) and then the comparison fails.
That seems reasonable to me - you're unlikely to want to discard
information from an equality test.


The obvious question is - why are you comparing a date to a timestamp in
the first place?

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] store pdf files

2008-12-09 Thread Thomas Kellerer

ivan marchesini wrote on 09.12.2008 11:11:

Hi to all...
I need to create a db that contain link to some pdf files..
At the moment these are simple links (to the files that are stored into
the file system) storing paths into a column of a dbf table...

I need to manage this data considering that the db I'm going to create
will be moved in some months to another server... 
so I think that the link to the files positions into the file system

isn't a valid solution...

can you suggest me a better way to manage this pdf data?? 
Each pdf is quite small (<100k)


should I use BLOB?


I have made quite good experience storing BLOBs (bytea in Postgres) inside the 
database. I do like the transactional safety I get from the database and the 
fact that the database an handle several ten thousands of documents in a single 
table without a problem. With a file based solution you'll need to find a way to 
distribute the PDFs over different directories to ensure that the file count in 
each directory doesn't get too high (at least we had some problems storing about 
50.000 documents (Word, pdf, ...) on a HP/UX machine in a single directory. 
Especially when trying to access that directory via ftp...


Thomas


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] store pdf files

2008-12-09 Thread Andreas Joseph Krogh
On Tuesday 09 December 2008 18:57:15 Frank Bax wrote:
> ivan marchesini wrote:
> > Hi to all...
> > I need to create a db that contain link to some pdf files..
> > At the moment these are simple links (to the files that are stored into
> > the file system) storing paths into a column of a dbf table...
> > 
> > I need to manage this data considering that the db I'm going to create
> > will be moved in some months to another server... 
> > so I think that the link to the files positions into the file system
> > isn't a valid solution...
> > 
> > can you suggest me a better way to manage this pdf data?? 
> > Each pdf is quite small (<100k)
> > 
> > should I use BLOB?
> 
> 
> Leave the files on filesystem and code the highest level directory into 
> a function.

Absolutely not. You'll loose transactional safety the DB provides. That reason 
alone is good enough to justify storing the files in a BLOB (BYTEA in PG).

The length(BYTEA)-function in PG can be used to retrieve its length. It is 
IMMUTABLE and "constant time", which means you can apply a functional-index on 
it. That way you don't need to store the size of the file in a separate column.

-- 
Andreas Joseph Krogh <[EMAIL PROTECTED]>
Senior Software Developer / CEO
+-+
OfficeNet AS| The most difficult thing in the world is to |
Karenslyst Allé 11  | know how to do a thing and to watch |
PO. Box 529 Skøyen  | somebody else doing it wrong, without   |
0214 Oslo   | comment.|
NORWAY  | |
Tlf:+47 24 15 38 90 | |
Fax:+47 24 15 38 91 | |
Mobile: +47 909  56 963 | |
+-+

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] inconsistent automatic casting between psql and function

2008-12-09 Thread Richard Huxton
Stefano Buliani wrote:
> Richard,
> 
> understand I shouldn't be comparing a date to a timestamp. Fact is I
> need the full timestamp to process other info in the rest of the function.
> 
> My question is: why is the planner casting the timestamp to date when I
> run the query from psql and the other way around from the function?

It's not. As I said, a quoted literal isn't necessarily a timestamp.

This: '2008-12-09 18:23:00' is not a timestamp.

It is an untyped quoted literal that contains something I'll grant
*looks* like a timestamp, but we can't tell what it is really supposed
to be until it's used.

SELECT length('2008-12-09 18:00:00');

Here it must be text (because we don't have a length() defined for
timestamps - see \df length).

=> SELECT date_trunc('month', '2008-12-09 18:00:00');
ERROR:  function date_trunc(unknown, unknown) is not unique
LINE 1: SELECT date_trunc('month', '2008-12-09 18:00:00');
   ^
HINT:  Could not choose a best candidate function. You might need to add
explicit type casts.

Here it couldn't decide (\df date_trunc to see what it was choosing between)

And in the next one it guesses it has an interval (because that's what
the other thing is, I'm guessing).

=> SELECT '2008-12-09 18:00:00' + '2 hours'::interval;
ERROR:  invalid input syntax for type interval: "2008-12-09 18:00:00"


So - it's not casting from timestamp to date, it's casting from
"unknown" to date in your interactive sql.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] inconsistent automatic casting between psql and function

2008-12-09 Thread Tom Lane
Richard Huxton <[EMAIL PROTECTED]> writes:
> That's because a quoted literal isn't necessarily a timestamp. Without
> context it could be anything, and in the context of comparing to a date
> the planner probably tries to make it a date.

I think the real point here is this:

regression=# select '2008-12-09 02:00:00'::date;
date

 2008-12-09
(1 row)

ie, when it does decide that a literal should be a date, it will happily
throw away any additional time-of-day fields that might be in there.
Had it raised an error, Stefano might have figured out his mistake
sooner.

ISTM we deliberately chose this behavior awhile back, but I wonder
whether it does more harm than good.

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql