[SQL] how to transform list to table and evaluate an expression on it?

2006-01-07 Thread Tomas Vondra
Hello,

in several apps running over postgres 8.0/8.1 I've used following
"full-text" search engine - imagine there are two tables:

1) Documents - with "docId", "title", "abstract" and "body" columns
2) Index - with "docId", "word" and "score"

It's very easy to build the Index from the Documents table - for each
document (row in the Documents table) a list of words is built, and the
number of occurences is used as the score (there are different
possibilities, of course).

When searching for a string, at first I search for documents with at
least one of the words in the string. So for example when searching for
"apples oranges", at first I do something like

SELECT docId, word, score FROM Index where word IN ('apples','oranges');

Then I fetch all these results to PHP, and in PHP, I transform this
"list" of words into two-dimensional associative array with "document
ID" as a key at first level and words at second level.

$results = array(
$docId1 => array(
word1 => score,
word2 => score,
...
   )
$docId2 => array(
word1 => score,
word2 => score,
...
   )
...
);

And this array is processed (in PHP) to get the final score - each OR
connective is translated as MAX() and each AND connective is translated
as MIN() function (on each element of the array).

For example the "apples AND oranges" would be translated as

MIN(score for apples, score for oranges)

Then I have a list of (docId,score) pairs, so I can fetch details for
the documents I want (the ones with the highest score, for example).

QUESTIONS & PROBLEMS


What I really don't like on the "algorithm" is the need to process the
data outside of the database - fetch them into the PHP, process them and
query the database again to get the final result. It's not fast, it's
not shared for different parts of the projects written in different
languages, etc.

I'd like to transform all this processing into PL/pgSQL (or maybe
PL/Perl or something else) function, but I don't know

1) How to create a "table" in the form

   documtent_id  | word_1 | word_2 | ... | word_n
  -
1345 |   11   |   12   | ... |   2
1202 |1   |0   | ... |  12
  .   ..  .  .
  .   ..  .  .
1129 |   20   |1   | ... |   0

   from the query

   SELECT docId, word, score FROM Index WHERE word IN (word_1,..,word_n)

2) How to evaluate the function derived from the 'search string' on this
   table, but this probably will work as an EXECUTE statement or
   something like that.

Maybe there's a completely different solution to all of this, which I
can't see.

Thanks for your advices
Tomas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[SQL] problem comparing strings when different cluster / database encoding

2006-04-05 Thread Tomas Vondra
Greetings,

I've encountered a strange problem. We have a PG 8.0.x database cluster
(in the sense used in initdb, i.e. bunch of databases) created with
UNICODE encoding, namely cs_CZ.UTF-8 locale.

When a database is created with a different encoding (in our case it's
LATIN2) the string comparison doesn't work correctly. For example the query

   SELECT 'ě' = 'é';

returns 'true' which is obviously incorrect, as those two letters have
different accents (I hope you can see that). And of course, it's not
possible to create an unique index (or primary key) over a column of
words (for example in a dictionnary), because false collisions are
found, and the sorting works in a really strange way too.

If the both cluster and database are in the same encoding (UNICODE or
LATIN2), everything works fine.

Below is a short description how the database cluster and the databases
have been created.

--
$ export LANG="cs_CZ.UTF-8"
$ initdb ... (cluster created with UNICODE encoding, cs_CZ.UTF-8 locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'true', which is incorrect)
--

If we create the cluster with LATIN2 encoding (or on the contrary the
database is created with UNICODE encoding), everything works fine.
For example the following works as expected.

--
$ export LANG="cs_CZ" (thus the ISO-8859-2 encoding is used)
$ initdb ... (cluster created with LATIN2 encoding, cs_CZ locale)
$ ... (postgres started, users created, etc.)
$ createdb -E LATIN2 my_database;
$ psql my_database;
> SELECT 'ě' = 'é'; (returns 'false', which is correct)
--

I'm trying to solve this for several days, but unsuccesfully. Is there
something I've missed? Some obvious solution I don't see?

The queston is why we need different encodings for cluster / databases.

 (a) Until recently we've used LATIN2 cluster and LATIN2 databases (and
 applications expecting LATIN2 encoding) - that's the reason why we
 need LATIN2 databases.

 (c) On the other way some of the new clients want to 'internationalize'
 their applications, so we need UNICODE infrastructure too - that's
 the reason why we use UNICODE cluster and databases.

I've came accross the nls_string function - with it it works fine, but
that's not an option for us, as it would require rewriting all the SQL
queries in the applications (and that's something we don't want).

Thanks for your suggestions
Tomas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SQL] help with pagila

2006-09-01 Thread Tomas Vondra
> But, when I add another column on select, like, film_description, I get
> the following error:
> 
> "ERROR:  column "film.description" must appear in the GROUP BY clause or
> be used in an aggregate function"
> 
> If I put that column on GROUP BY everything works ok. But I want
> understant why do I need to do that. Can someone teach me, please?

The reason is pretty simple - GROUPing actually means "sorting into
boxes by values in the columns after the GROUP BY keyword" (and then
applying some aggregate functions to these boxes, as for example COUNT,
AVG etc.) Besides these aggregates, you can select a column that
'constant' for each of the boxes, that is all the rows in that box have
the same value in this column.

That's the case of the first SQL query you've posted - you're grouping
by 'film_id', thus all the rows in a box have the same value in this
column. And thanks to this you can select the value in the SELECT.

But in the second query, you'd like to select another column (directly,
not through an aggregate function) - title. But there could be different
values for each row in the box (PostgreSQL doesn't know that the ID
uniquely identifies the title). For example imagine you would group by
'published_year' instead of the 'film_id' - in this case there would be
many different movies in the same box, in which case it's impossible to
select 'title' for all of them.

There are two ways to solve this:

1) add the 'title' to the GROUP BY clause, thus all the rows in a box
   have the same value of 'title' (and postgresql knows about that)

2) use a subselect

   film_id,
   (SELECT title FROM film WHERE film_id = film_outer.film_id) AS title
   FROM film AS film_outer
   ...
   GROUP BY film_id;

Tomas


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [SQL] help with pagila

2006-09-01 Thread Tomas Vondra
> So I can assume that the MySQL implementation is strange? (It accepts
> that kind of query)

Yes, MySQL behaves strangely in this case (as well as in several other
cases). I wouldn't rely on this as it probably can choose different
values each time (although as far as I remember I haven't seen this).

t.v.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[SQL] implementing (something like) UNIQUE constraint using PL/pgSQL

2007-01-26 Thread Tomas Vondra

Hello,

in our application we need to implement a constraint that enforces 'at 
most N rows with this value', that is we have a table with 'flag' column 
and for each value there should be at most 10 rows (for example, the 
exact number does not matter).


I'm trying to implement a PL/pgSQL trigger to enforce this constraint, 
but I'm not sure my solution is 'the best one possible'. The first 
solution I came up with is this:


=

CREATE OR REPLACE FUNCTION at_most() RETURNS trigger AS $$
DECLARE
v_cnt INTEGER;
p_cnt INTEGER;
BEGIN

   IF TG_NARGS >= 1 THEN
  p_cnt := TG_ARGV[0]::integer;
   ELSE
  p_cnt := 1;
   END IF;

   SELECT COUNT(*) INTO v_cnt FROM my_table WHERE flag = NEW.flag;
   IF v_cnt > p_cnt THEN
   RAISE EXCEPTION 'Too many rows with this flag!'
   END IF;

   RETURN NEW;

END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER at_most AFTER INSERT OR UPDATE ON my_table FOR EACH ROW 
EXECUTE PROCEDURE at_most(10);


=

But that obviously does not work as two sessions can reach the SELECT 
INTO statement at the same time (or until one of them commits). Thus 
there could be more than 'cnt' rows with the same value.


Then I used a 'FOR UPDATE' lock on a separate 'lock table' that already 
contains all possible values of 'flag' (in reality the trigger tries to 
insert that value and catches the 'duplicate key' exception but that's 
not important here). The trigger is then


=

CREATE OR REPLACE FUNCTION at_most() RETURNS trigger AS $$
DECLARE
v_cnt INTEGER;
p_cnt INTEGER;
BEGIN

   IF TG_NARGS >= 1 THEN
  p_cnt := TG_ARGV[0]::integer;
   ELSE
  p_cnt := 1;
   END IF;

   PERFORM flag FROM lock_table WHERE flag = NEW.flag FOR UPDATE;

   SELECT COUNT(*) INTO v_cnt FROM my_table WHERE flag = NEW.flag;
   IF v_cnt > p_cnt THEN
   RAISE EXCEPTION 'Too many rows with this flag!';
   END IF;

   RETURN NEW;

END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER at_most AFTER INSERT OR UPDATE ON my_table FOR EACH ROW 
EXECUTE PROCEDURE at_most(10);


=

This works (af far as I know), but I'm not sure it's the best possible 
solution - for example I'd  like to remove the lock table. Is there some 
'more elegant' way to do this?


Tomas

PS: Is there some up to date 'best practices' book related to PL/pgSQL?
All books I've found on Amazon are pretty old (about 5 years) or are
related to 'beginners' or different areas of development (PHP, etc.)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [SQL] Differentiate Between Zero-Length String and NULLColumn Values

2007-01-30 Thread Tomas Vondra
> On Tue, Jan 30, 2007 at 02:38:07PM +0100, Bart Degryse wrote:
> 
>> Andrew, I think you're wrong stating that Oracle would interpret
>> NULL and empty string as equal. The Oracle databases I use (8, 9
>> and 10) certainly make a distiction between both values. Maybe
>> earlier versions did so, that I don't know.
> 
> Hmm.  Well, I'm not an Oracle guy, so I don't really know.  All I
> know is that we occasionally get people coming from Oracle who are
> surprised by this difference.  What I've been _told_ is that '' and
> NULL are under some circumstances (maybe integers?) the same thing,
> whereas of course ' ' and NULL are not.  But since I'm not an Oracle
> user, people should feel free to ignore me :)

I've recently read some books on Oracle, so probably the best thing I
can do is to quote a paragraph on this from "Oracle PL/SQL programming"
from O'Reilly:

In Oracle SQL and PL/SQL, a null string is 'usually' indistiguishable
from a literal of zero characters, represented literally as ''. For
example the following expression will evaluate to TRUE both in SQL and
PL/SQL:

   '' IS NULL

Assigning a zero-length string to a VARCHAR2(n) variable in PL/SQL also
yields a NULL result:

   DECLARE
  str VARCHAR2(1) := '';
   BEGIN
  IF str IS NULL   -- will be TRUE

This behavior is consistent with Oracle's treatment of VARCHAR2 table
columns.

...

These examples illustrate Oracle's partial adherence to the 92 and 99
versions of the ANSI SQL standard, which mandates a difference between a
zero-length string and a NULL string. Oracle admits the difference, and
says they may fully adopt the standard in the future. They've been
issuing that warning for about 10 years, though, and it hasn't happened yet.

...

Note: This does not apply to the CHAR(n) columns - these are
  blank-padded.

Tomas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster