[SQL] A DISTINCT problem removing duplicates
The scenario is - a list of documents, some of which may be (near) duplicates of others, one document being in many duplicate-sets and a duplicate-set containing many documents. We want to see a list with only one document (any one) from each duplicate set. There's an example script attached. So: documents (docid SERIAL, title text, PRIMARY KEY (docid)); duplicates (docid int REFERENCES documents, dup_set SERIAL, PRIMARY KEY (docid, dup_set)); This allows one document to be part of multiple duplicate sets, but that's fine - this is a "fuzzy" match. If two documents match and one of them is already in the duplicates table then I add the second with the same dup_set value. If neither are present, generate a new set number. Match documents in a well-defined order and it's all nice and simple. A self-join on the duplicates table gives me a count of how many duplicates each document has. A left-join from the documents table can list documents and if/how many duplicates they have. The problem comes when I don't want to see duplicates: SELECT DISTINCT ON (dup_set) ds.dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; Documents without duplicates have a NULL dup_set. The DISTINCT ON considers two nulls to be equal, which means we only ever see one unduplicated document. I've got two work-arounds. The first is to create a separate sequence that doesn't overlap with dup_set's values and use that: CREATE SEQUENCE not_duplicate_seq MINVALUE -99 MAXVALUE -1 CYCLE; SELECT DISTINCT ON (dup_set) COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups ... That works, but is arguably a bit too "clever" if you know what I mean. The other alternative is to separate duplicated and non-duplicated documents and UNION them. That's simple enough to see what's happening but does seem ugly. Anyone got anything more elegant? I'm happy to alter the duplicates table so long as it doesn't make it complicated to update. -- Richard Huxton Archonet Ltd BEGIN; CREATE SCHEMA duptest; SET search_path = duptest; CREATE TABLE documents ( docid SERIAL NOT NULL, title text, PRIMARY KEY (docid) ); CREATE TABLE duplicates ( docid int NOT NULL REFERENCES documents ON DELETE CASCADE, dup_set SERIAL NOT NULL, PRIMARY KEY (docid, dup_set) ); -- Five documents INSERT INTO documents (docid, title) SELECT i, 'document number ' || i FROM generate_series(1, 6) i; -- duplicates are (1,3) and (2,4) - 5,6 are not INSERT INTO duplicates (docid, dup_set) SELECT i, (i % 2)+1 FROM generate_series(1, 4) i; SELECT setval('documents_docid_seq', (SELECT max(docid)+1 FROM documents)); SELECT setval('duplicates_dup_set_seq', (SELECT max(dup_set)+1 FROM duplicates)); -- This is a list of all documents with how many duplicates they have SELECT dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; -- This DOESN'T work because nulls are considered equal by DISTINCT ON SELECT DISTINCT ON (dup_set) dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; -- Work around is to fake a unique id for non-duplicates CREATE SEQUENCE not_duplicate_seq MINVALUE -99 MAXVALUE -1 CYCLE; SELECT DISTINCT ON (dup_set) COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) ORDER BY dup_set, docid ; -- Second alternative - UNION duplicated and non-duplicated documents SELECT DISTINCT ON (dup_set) ds.dup_set, d.docid, d.title, num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (docid) WHERE num_dups > 0 UNION ALL SELECT ds.dup_set, d.docid, d.title, COALESCE(num_dups,0) AS num_dups FROM documents d LEFT JOIN ( SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups FROM duplicates dup1 JOIN duplicates dup2 USING (dup_set) GROUP BY dup1.docid, dup1.dup_set ) ds USING (doci
Re: [SQL] DIVISION with Group By
hdhgdh mjhff wrote: > Hai all, > > I am new to SQL coding. > I need some help in writing SQL > In my task i have to calculate division of two sub queries. > each sub query has its own group by clause. Can you provide a simpler example? It's not obvious to me what you're trying to do. -- Richard Huxton Archonet Ltd -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] A DISTINCT problem removing duplicates
Richard Huxton <[EMAIL PROTECTED]> writes: > Anyone got anything more elegant? Seems to me that no document should have an empty dup_set. If it's not a match to any existing document, then immediately assign a new dup_set number to it. regards, tom lane -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] A DISTINCT problem removing duplicates
Tom Lane wrote: > Richard Huxton <[EMAIL PROTECTED]> writes: >> Anyone got anything more elegant? > > Seems to me that no document should have an empty dup_set. If it's not > a match to any existing document, then immediately assign a new dup_set > number to it. That was my initial thought too, but it means when I actually find a duplicate I have to decide which "direction" to renumber them in. It also means probably keeping a summary table with counts to show which are duplicates, since the duplicates table is now the same size as the documents table. -- Richard Huxton Archonet Ltd -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] A DISTINCT problem removing duplicates
Richard Huxton <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Richard Huxton <[EMAIL PROTECTED]> writes: >>> Anyone got anything more elegant? >> >> Seems to me that no document should have an empty dup_set. If it's not >> a match to any existing document, then immediately assign a new dup_set >> number to it. > That was my initial thought too, but it means when I actually find a > duplicate I have to decide which "direction" to renumber them in. Hmm, so you mean you might decide that two docs are duplicates sometime after initially putting them both in the database? Seems like you have issues with that anyway. If you already know A,B are dups and separately that C,D are dups, and you later decide B and C are dups, what do you do? regards, tom lane -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] A DISTINCT problem removing duplicates
Tom Lane wrote: > Richard Huxton <[EMAIL PROTECTED]> writes: >> Tom Lane wrote: >>> Richard Huxton <[EMAIL PROTECTED]> writes: Anyone got anything more elegant? >>> Seems to me that no document should have an empty dup_set. If it's not >>> a match to any existing document, then immediately assign a new dup_set >>> number to it. > >> That was my initial thought too, but it means when I actually find a >> duplicate I have to decide which "direction" to renumber them in. > > Hmm, so you mean you might decide that two docs are duplicates sometime > after initially putting them both in the database? Yep - checking for duplicates can be a slow process - it's O(n^2) over the number of documents and document-comparisons are probably O(n^2) over length (or number of similarly-sized word-runs anyway). I'm failing comparisons as early as I can, but there's a trade-off between speed and false negatives. > Seems like you have > issues with that anyway. If you already know A,B are dups and > separately that C,D are dups, and you later decide B and C are dups, > what do you do? Not necessarily a problem. I'm using "duplicate" very loosely here - it's more like "very similar to" so it's entirely possible to have sets (a,b) (b,c) (c,d) and everything be valid just by adding sentences to the end of each document. Similarity scoring should allow for insertion/deletion of single words or whole (quite extensive) blocks of text. Of course at the moment, as I tweak what I mean by "duplicate" I have to re-run the check over at least a sizeable chunk of the documents to see if I prefer it. Oh - the comparison is outside the DB at the moment, but it's based on the stemmed tsvector of each document anyway, so it's crying out to be pushed into the DB once I'm happy it works. -- Richard Huxton Archonet Ltd -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] store pdf files
Hi to all... I need to create a db that contain link to some pdf files.. At the moment these are simple links (to the files that are stored into the file system) storing paths into a column of a dbf table... I need to manage this data considering that the db I'm going to create will be moved in some months to another server... so I think that the link to the files positions into the file system isn't a valid solution... can you suggest me a better way to manage this pdf data?? Each pdf is quite small (<100k) should I use BLOB? many thanks Ivan -- Ti prego di cercare di non inviarmi files .dwg, .doc, .xls, .ppt. Preferisco formati liberi. Please try to avoid to send me .dwg, .doc, .xls, .ppt files. I prefer free formats. http://it.wikipedia.org/wiki/Formato_aperto http://en.wikipedia.org/wiki/Open_format Ivan Marchesini Department of Civil and Environmental Engineering University of Perugia Via G. Duranti 93/a 06125 Perugia (Italy) Socio fondatore GFOSS "Geospatial Free and Open Source Software" http://www.gfoss.it e-mail: [EMAIL PROTECTED] [EMAIL PROTECTED] tel: +39(0)755853760 fax (university): +39(0)755853756 fax (home): +39(0)5782830887 jabber: [EMAIL PROTECTED] -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
[SQL] inconsistent automatic casting between psql and function
Hello all, I'm experiencing a strange problem with postgresql 8.3.4. I have the following table tx_queue txid serial securityid integer portfolioid integer datequeued timestamp default now() tradedate date numshares numeric(25,7) transactiontype char(1) tradeprice numeric(25,7) every time a new tx is created in the table I need to check if it's the result of 2 previous transaction being aggregated: I receive 2 tx with the following values: securityid, portfolioid, tradedate, numshares, transactiontype, tradeprice 2, 1, '2008-12-08', 2, 'B', 15.23 2, 1, '2008-12-08', 6, 'B', 15.23 Later I'll receive another tx: 2, 1, '2008-12-08', 8, 'B', 15.23 This isn't a new trade but just the sum of the previous 2 it should be therefore ignored. To create the tx in the table I use a function which receives all the values and runs the following query to check whether it's a sum of previous txs. (The tradedate is passed as a timestamp) SELECT INTO vpsum sub1.possible_sum FROM ( SELECT tq.securityid, date_trunc('hour', tq.datequeued) AS split_tq_time, count(*) AS cnt, sum(tq.numshares) as possible_sum, tq.transactiontype, tq.tradeprice, tq.portfolioid FROM tx_queue AS tq WHERE tq.securityid= 2 AND tq.tradeprice = 15.23 AND tq.portfolioid = 1 AND tq.tradedate = '2008-12-08 02:00:00' AND tq.datequeued + interval '1 hour' <= now() -- tx received more than 1 hour ago GROUP BY date_trunc('hour', tq.datequeued), tq.securityid, tq.portfolioid, tq.tradeprice, tq.transactiontype HAVING count(*)>1 ) AS sub1 WHERE sub1.possible_sum = 8.000 ORDER BY sub1.split_tq_time DESC LIMIT 1; If I run this query from the psql client it works just fine. From the function it doesn't return anything. What I discovered is that for it to work from the function I need to explicitly cast the tradedate variable to DATE (ie '2008-12-08 02:00:00'::DATE - Note the field is of type date). It would seem that the psql client casts the value automatically. Any reason why this should be? This inconsistent behaviour makes code much harder to debug. Regards, Stefano - Stefano Buliani Covestor This message is intended solely for the recipient(s) to whom it is addressed. If you are not the intended recipient, you should not disclose, distribute or copy this email. Please notify the sender immediately and delete this email from your system.
Re: [SQL] store pdf files
ivan marchesini wrote: Hi to all... I need to create a db that contain link to some pdf files.. At the moment these are simple links (to the files that are stored into the file system) storing paths into a column of a dbf table... I need to manage this data considering that the db I'm going to create will be moved in some months to another server... so I think that the link to the files positions into the file system isn't a valid solution... can you suggest me a better way to manage this pdf data?? Each pdf is quite small (<100k) should I use BLOB? Leave the files on filesystem and code the highest level directory into a function. -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] inconsistent automatic casting between psql and function
Stefano Buliani wrote: > If I run this query from the psql client it works just fine. From the > function it doesn't return anything. > What I discovered is that for it to work from the function I need to > explicitly cast the tradedate variable to DATE (ie '2008-12-08 > 02:00:00'::DATE - Note the field is of type date). > > It would seem that the psql client casts the value automatically. > Any reason why this should be? > This inconsistent behaviour makes code much harder to debug. Seems unlikely. Can't reproduce the problem assuming you're using a quoted literal as your query shows. => CREATE FUNCTION date_test() RETURNS boolean AS $$BEGIN RETURN current_date = '2008-12-09 02:00:00'; END;$$ LANGUAGE plpgsql; CREATE FUNCTION => SELECT date_test(); date_test --- t => SELECT current_date = '2008-12-09 02:00:00'; ?column? -- t On the other hand, if you are using variable interpolation: CREATE OR REPLACE FUNCTION date_test2(timestamp) RETURNS boolean AS $$BEGIN RETURN current_date = $1; END;$$ LANGUAGE plpgsql; CREATE FUNCTION => SELECT date_test2('2008-12-09 02:00:00'); date_test2 f => SELECT current_date = '2008-12-09 02:00:00'::timestamp; ?column? -- f That's because a quoted literal isn't necessarily a timestamp. Without context it could be anything, and in the context of comparing to a date the planner probably tries to make it a date. Your variable is definitely a timestamp though (you've said so explicitly) so PG has to decide what it means to compare a date to a timestamp. It decides the reasonable approach is to turn the date into a timestamp (by adding '00:00:00' to it) and then the comparison fails. That seems reasonable to me - you're unlikely to want to discard information from an equality test. The obvious question is - why are you comparing a date to a timestamp in the first place? -- Richard Huxton Archonet Ltd -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] store pdf files
ivan marchesini wrote on 09.12.2008 11:11: Hi to all... I need to create a db that contain link to some pdf files.. At the moment these are simple links (to the files that are stored into the file system) storing paths into a column of a dbf table... I need to manage this data considering that the db I'm going to create will be moved in some months to another server... so I think that the link to the files positions into the file system isn't a valid solution... can you suggest me a better way to manage this pdf data?? Each pdf is quite small (<100k) should I use BLOB? I have made quite good experience storing BLOBs (bytea in Postgres) inside the database. I do like the transactional safety I get from the database and the fact that the database an handle several ten thousands of documents in a single table without a problem. With a file based solution you'll need to find a way to distribute the PDFs over different directories to ensure that the file count in each directory doesn't get too high (at least we had some problems storing about 50.000 documents (Word, pdf, ...) on a HP/UX machine in a single directory. Especially when trying to access that directory via ftp... Thomas -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] store pdf files
On Tuesday 09 December 2008 18:57:15 Frank Bax wrote: > ivan marchesini wrote: > > Hi to all... > > I need to create a db that contain link to some pdf files.. > > At the moment these are simple links (to the files that are stored into > > the file system) storing paths into a column of a dbf table... > > > > I need to manage this data considering that the db I'm going to create > > will be moved in some months to another server... > > so I think that the link to the files positions into the file system > > isn't a valid solution... > > > > can you suggest me a better way to manage this pdf data?? > > Each pdf is quite small (<100k) > > > > should I use BLOB? > > > Leave the files on filesystem and code the highest level directory into > a function. Absolutely not. You'll loose transactional safety the DB provides. That reason alone is good enough to justify storing the files in a BLOB (BYTEA in PG). The length(BYTEA)-function in PG can be used to retrieve its length. It is IMMUTABLE and "constant time", which means you can apply a functional-index on it. That way you don't need to store the size of the file in a separate column. -- Andreas Joseph Krogh <[EMAIL PROTECTED]> Senior Software Developer / CEO +-+ OfficeNet AS| The most difficult thing in the world is to | Karenslyst Allé 11 | know how to do a thing and to watch | PO. Box 529 Skøyen | somebody else doing it wrong, without | 0214 Oslo | comment.| NORWAY | | Tlf:+47 24 15 38 90 | | Fax:+47 24 15 38 91 | | Mobile: +47 909 56 963 | | +-+ -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] inconsistent automatic casting between psql and function
Stefano Buliani wrote: > Richard, > > understand I shouldn't be comparing a date to a timestamp. Fact is I > need the full timestamp to process other info in the rest of the function. > > My question is: why is the planner casting the timestamp to date when I > run the query from psql and the other way around from the function? It's not. As I said, a quoted literal isn't necessarily a timestamp. This: '2008-12-09 18:23:00' is not a timestamp. It is an untyped quoted literal that contains something I'll grant *looks* like a timestamp, but we can't tell what it is really supposed to be until it's used. SELECT length('2008-12-09 18:00:00'); Here it must be text (because we don't have a length() defined for timestamps - see \df length). => SELECT date_trunc('month', '2008-12-09 18:00:00'); ERROR: function date_trunc(unknown, unknown) is not unique LINE 1: SELECT date_trunc('month', '2008-12-09 18:00:00'); ^ HINT: Could not choose a best candidate function. You might need to add explicit type casts. Here it couldn't decide (\df date_trunc to see what it was choosing between) And in the next one it guesses it has an interval (because that's what the other thing is, I'm guessing). => SELECT '2008-12-09 18:00:00' + '2 hours'::interval; ERROR: invalid input syntax for type interval: "2008-12-09 18:00:00" So - it's not casting from timestamp to date, it's casting from "unknown" to date in your interactive sql. -- Richard Huxton Archonet Ltd -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] inconsistent automatic casting between psql and function
Richard Huxton <[EMAIL PROTECTED]> writes: > That's because a quoted literal isn't necessarily a timestamp. Without > context it could be anything, and in the context of comparing to a date > the planner probably tries to make it a date. I think the real point here is this: regression=# select '2008-12-09 02:00:00'::date; date 2008-12-09 (1 row) ie, when it does decide that a literal should be a date, it will happily throw away any additional time-of-day fields that might be in there. Had it raised an error, Stefano might have figured out his mistake sooner. ISTM we deliberately chose this behavior awhile back, but I wonder whether it does more harm than good. regards, tom lane -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql