[SQL] how to transform list to table and evaluate an expression on it?
Hello, in several apps running over postgres 8.0/8.1 I've used following "full-text" search engine - imagine there are two tables: 1) Documents - with "docId", "title", "abstract" and "body" columns 2) Index - with "docId", "word" and "score" It's very easy to build the Index from the Documents table - for each document (row in the Documents table) a list of words is built, and the number of occurences is used as the score (there are different possibilities, of course). When searching for a string, at first I search for documents with at least one of the words in the string. So for example when searching for "apples oranges", at first I do something like SELECT docId, word, score FROM Index where word IN ('apples','oranges'); Then I fetch all these results to PHP, and in PHP, I transform this "list" of words into two-dimensional associative array with "document ID" as a key at first level and words at second level. $results = array( $docId1 => array( word1 => score, word2 => score, ... ) $docId2 => array( word1 => score, word2 => score, ... ) ... ); And this array is processed (in PHP) to get the final score - each OR connective is translated as MAX() and each AND connective is translated as MIN() function (on each element of the array). For example the "apples AND oranges" would be translated as MIN(score for apples, score for oranges) Then I have a list of (docId,score) pairs, so I can fetch details for the documents I want (the ones with the highest score, for example). QUESTIONS & PROBLEMS What I really don't like on the "algorithm" is the need to process the data outside of the database - fetch them into the PHP, process them and query the database again to get the final result. It's not fast, it's not shared for different parts of the projects written in different languages, etc. I'd like to transform all this processing into PL/pgSQL (or maybe PL/Perl or something else) function, but I don't know 1) How to create a "table" in the form documtent_id | word_1 | word_2 | ... | word_n - 1345 | 11 | 12 | ... | 2 1202 |1 |0 | ... | 12 . .. . . . .. . . 1129 | 20 |1 | ... | 0 from the query SELECT docId, word, score FROM Index WHERE word IN (word_1,..,word_n) 2) How to evaluate the function derived from the 'search string' on this table, but this probably will work as an EXECUTE statement or something like that. Maybe there's a completely different solution to all of this, which I can't see. Thanks for your advices Tomas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[SQL] problem comparing strings when different cluster / database encoding
Greetings, I've encountered a strange problem. We have a PG 8.0.x database cluster (in the sense used in initdb, i.e. bunch of databases) created with UNICODE encoding, namely cs_CZ.UTF-8 locale. When a database is created with a different encoding (in our case it's LATIN2) the string comparison doesn't work correctly. For example the query SELECT 'ě' = 'é'; returns 'true' which is obviously incorrect, as those two letters have different accents (I hope you can see that). And of course, it's not possible to create an unique index (or primary key) over a column of words (for example in a dictionnary), because false collisions are found, and the sorting works in a really strange way too. If the both cluster and database are in the same encoding (UNICODE or LATIN2), everything works fine. Below is a short description how the database cluster and the databases have been created. -- $ export LANG="cs_CZ.UTF-8" $ initdb ... (cluster created with UNICODE encoding, cs_CZ.UTF-8 locale) $ ... (postgres started, users created, etc.) $ createdb -E LATIN2 my_database; $ psql my_database; > SELECT 'ě' = 'é'; (returns 'true', which is incorrect) -- If we create the cluster with LATIN2 encoding (or on the contrary the database is created with UNICODE encoding), everything works fine. For example the following works as expected. -- $ export LANG="cs_CZ" (thus the ISO-8859-2 encoding is used) $ initdb ... (cluster created with LATIN2 encoding, cs_CZ locale) $ ... (postgres started, users created, etc.) $ createdb -E LATIN2 my_database; $ psql my_database; > SELECT 'ě' = 'é'; (returns 'false', which is correct) -- I'm trying to solve this for several days, but unsuccesfully. Is there something I've missed? Some obvious solution I don't see? The queston is why we need different encodings for cluster / databases. (a) Until recently we've used LATIN2 cluster and LATIN2 databases (and applications expecting LATIN2 encoding) - that's the reason why we need LATIN2 databases. (c) On the other way some of the new clients want to 'internationalize' their applications, so we need UNICODE infrastructure too - that's the reason why we use UNICODE cluster and databases. I've came accross the nls_string function - with it it works fine, but that's not an option for us, as it would require rewriting all the SQL queries in the applications (and that's something we don't want). Thanks for your suggestions Tomas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] help with pagila
> But, when I add another column on select, like, film_description, I get > the following error: > > "ERROR: column "film.description" must appear in the GROUP BY clause or > be used in an aggregate function" > > If I put that column on GROUP BY everything works ok. But I want > understant why do I need to do that. Can someone teach me, please? The reason is pretty simple - GROUPing actually means "sorting into boxes by values in the columns after the GROUP BY keyword" (and then applying some aggregate functions to these boxes, as for example COUNT, AVG etc.) Besides these aggregates, you can select a column that 'constant' for each of the boxes, that is all the rows in that box have the same value in this column. That's the case of the first SQL query you've posted - you're grouping by 'film_id', thus all the rows in a box have the same value in this column. And thanks to this you can select the value in the SELECT. But in the second query, you'd like to select another column (directly, not through an aggregate function) - title. But there could be different values for each row in the box (PostgreSQL doesn't know that the ID uniquely identifies the title). For example imagine you would group by 'published_year' instead of the 'film_id' - in this case there would be many different movies in the same box, in which case it's impossible to select 'title' for all of them. There are two ways to solve this: 1) add the 'title' to the GROUP BY clause, thus all the rows in a box have the same value of 'title' (and postgresql knows about that) 2) use a subselect film_id, (SELECT title FROM film WHERE film_id = film_outer.film_id) AS title FROM film AS film_outer ... GROUP BY film_id; Tomas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] help with pagila
> So I can assume that the MySQL implementation is strange? (It accepts > that kind of query) Yes, MySQL behaves strangely in this case (as well as in several other cases). I wouldn't rely on this as it probably can choose different values each time (although as far as I remember I haven't seen this). t.v. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[SQL] implementing (something like) UNIQUE constraint using PL/pgSQL
Hello, in our application we need to implement a constraint that enforces 'at most N rows with this value', that is we have a table with 'flag' column and for each value there should be at most 10 rows (for example, the exact number does not matter). I'm trying to implement a PL/pgSQL trigger to enforce this constraint, but I'm not sure my solution is 'the best one possible'. The first solution I came up with is this: = CREATE OR REPLACE FUNCTION at_most() RETURNS trigger AS $$ DECLARE v_cnt INTEGER; p_cnt INTEGER; BEGIN IF TG_NARGS >= 1 THEN p_cnt := TG_ARGV[0]::integer; ELSE p_cnt := 1; END IF; SELECT COUNT(*) INTO v_cnt FROM my_table WHERE flag = NEW.flag; IF v_cnt > p_cnt THEN RAISE EXCEPTION 'Too many rows with this flag!' END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER at_most AFTER INSERT OR UPDATE ON my_table FOR EACH ROW EXECUTE PROCEDURE at_most(10); = But that obviously does not work as two sessions can reach the SELECT INTO statement at the same time (or until one of them commits). Thus there could be more than 'cnt' rows with the same value. Then I used a 'FOR UPDATE' lock on a separate 'lock table' that already contains all possible values of 'flag' (in reality the trigger tries to insert that value and catches the 'duplicate key' exception but that's not important here). The trigger is then = CREATE OR REPLACE FUNCTION at_most() RETURNS trigger AS $$ DECLARE v_cnt INTEGER; p_cnt INTEGER; BEGIN IF TG_NARGS >= 1 THEN p_cnt := TG_ARGV[0]::integer; ELSE p_cnt := 1; END IF; PERFORM flag FROM lock_table WHERE flag = NEW.flag FOR UPDATE; SELECT COUNT(*) INTO v_cnt FROM my_table WHERE flag = NEW.flag; IF v_cnt > p_cnt THEN RAISE EXCEPTION 'Too many rows with this flag!'; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER at_most AFTER INSERT OR UPDATE ON my_table FOR EACH ROW EXECUTE PROCEDURE at_most(10); = This works (af far as I know), but I'm not sure it's the best possible solution - for example I'd like to remove the lock table. Is there some 'more elegant' way to do this? Tomas PS: Is there some up to date 'best practices' book related to PL/pgSQL? All books I've found on Amazon are pretty old (about 5 years) or are related to 'beginners' or different areas of development (PHP, etc.) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [SQL] Differentiate Between Zero-Length String and NULLColumn Values
> On Tue, Jan 30, 2007 at 02:38:07PM +0100, Bart Degryse wrote: > >> Andrew, I think you're wrong stating that Oracle would interpret >> NULL and empty string as equal. The Oracle databases I use (8, 9 >> and 10) certainly make a distiction between both values. Maybe >> earlier versions did so, that I don't know. > > Hmm. Well, I'm not an Oracle guy, so I don't really know. All I > know is that we occasionally get people coming from Oracle who are > surprised by this difference. What I've been _told_ is that '' and > NULL are under some circumstances (maybe integers?) the same thing, > whereas of course ' ' and NULL are not. But since I'm not an Oracle > user, people should feel free to ignore me :) I've recently read some books on Oracle, so probably the best thing I can do is to quote a paragraph on this from "Oracle PL/SQL programming" from O'Reilly: In Oracle SQL and PL/SQL, a null string is 'usually' indistiguishable from a literal of zero characters, represented literally as ''. For example the following expression will evaluate to TRUE both in SQL and PL/SQL: '' IS NULL Assigning a zero-length string to a VARCHAR2(n) variable in PL/SQL also yields a NULL result: DECLARE str VARCHAR2(1) := ''; BEGIN IF str IS NULL -- will be TRUE This behavior is consistent with Oracle's treatment of VARCHAR2 table columns. ... These examples illustrate Oracle's partial adherence to the 92 and 99 versions of the ANSI SQL standard, which mandates a difference between a zero-length string and a NULL string. Oracle admits the difference, and says they may fully adopt the standard in the future. They've been issuing that warning for about 10 years, though, and it hasn't happened yet. ... Note: This does not apply to the CHAR(n) columns - these are blank-padded. Tomas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster