Re: [HACKERS] Strange errors from 9.2.1 and 9.2.2 (I hope I'm missing something obvious)

2012-12-16 Thread Dan Scott
On Dec 11, 2012 9:28 PM, David Gould da...@sonic.net wrote:

 Thank you. I got the example via cut and paste from email and pasted it
 into psql on different hosts. od tells me it ends each line with:

   \n followed by 0xC2 0xA0 and then normal spaces. The C2A0 thing is
   apparently NO-BREAK SPACE. Invisible, silent, odorless but still deadly.

 Which will teach me not to accept text files from the sort of people who
 write code in Word I guess.

It's not just Word... I was bitten by this last week by a WYSIWYG HTML
widget I was using to write some documentation. When I copied the examples
I had created out of said environment during a final technical accuracy
pass and they failed to run in psql, I panicked for a few minutes.

I eventually determined that, rather than just wrapping my code in pre
tags, the widget had created nbsp; entities that were faithfully converted
into Unicode non-breaking spaces in the psql input.


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Dan Scott
On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 john.knight...@gmail.com wrote:
 When using to_tsvector  a number of newer unicode characters and pua
 characters are not included. How do I add the characters which I desire to
 be found?

I've just started digging into this code a bit, but from what I've
found src/backend/tsearch/wparser_def.c defines much of the parser
functionality, and in the area of Unicode includes a number of
comments like:

* with multibyte encoding and C-locale isw* function may fail or give
wrong result.
* multibyte encoding and C-locale often are used for Asian languages.
* any non-ascii symbol with multibyte encoding with C-locale is an
alpha character

... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
:)

Also note that src/test/regress/sql/tsearch.sql and
regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

Perhaps this is a good opportunity for you to describe what your
environment looks like (OS, PostgreSQL version, encoding and locale
settings for the database) and show some sample to_tsquery() @@
to_tsvector() queries that don't behave the way you think they should
behave - and we could start building some test cases as a first step?

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch, normalize search_path in index

2012-09-30 Thread Dan Scott
On Fri, Sep 28, 2012 at 1:40 PM, Karl O. Pinc k...@meme.com wrote:
 Hi,

 The attached patch (against git head)
 normalizes search_path as the thing indexed
 and uses a secondary index term to distinguish
 the configuration parameter from the run-time
 setting.

Makes sense to me, although I suspect the conceptual material is
better served by the search path-the-concept index entry and the
reference material by the search_path configuration parameter entry
(so, from that perspective, perhaps the patch should just be to remove
the search_path index entry from the DDL schemas conceptual
section).

 search path the concept remains distinguished
 in the index from search_path the setting/config param.
 It's hard to say whether it's useful to make this
 distinction.

I think that indexing search path-the-concept is useful for
translations, and the Japanese translation includes an index (I
couldn't find the index for the French translation).

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Dan Scott
Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
john.knight...@gmail.com wrote:
 Dear Dan,

 thank you for your reply.

 The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
 a utf8 local

 A short 5 line dictionary file  is sufficient to test:-

 raeuz
 我们
 昭厵
 꽖떂
 撘䮬

 line 1 raeuz Zhuang word written using English letters and show up
 under ts_vector ok
 line 2 我们 uses everyday Chinese word and show up under ts_vector ok
 line 3 昭厵 Zhuang word written using rather old Chinese charcters
 found in Unicode 3.1 which came in about the year 2000  and show up
 under ts_vector ok
 line 4 꽖떂 Zhuang word written using rather old Chinese charcters
 found in Unicode 5.2 which came in about the year 2009 but do not show
 up under ts_vector ok
 line 5 撘䮬 Zhuang word written using rather old Chinese charcters
 found in PUA area of the font Sawndip.ttf but do not show up under
 ts_vector ok (Font can be downloaded from
 http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

 The last two words even though included in a dictionary do not get
 accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the lc_ctype=C
lc_collate=C options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('撘䮬');
ts_debug

 (word,Word, all letters,撘䮬,{english_stem},english_stem,{撘䮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE;
  lc_ctype
-
 en_US.UTF-8
(1 row)

foobaz=# select ts_debug('撘䮬');
ts_debug
-
 (blank,Space symbols,撘䮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql gram.y make rule

2012-09-25 Thread Dan Scott
On Mon, Sep 24, 2012 at 10:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 I wanted to refactor the highly redundant flex and bison rules
 throughout the source into common pattern rules.  (Besides saving some
 redundant code, this could also help some occasionally flaky code in
 pgxs modules.)  The only outlier that breaks this is in plpgsql

 pl_gram.c: gram.y

 I would like to either rename the intermediate file(s) to gram.{c,h}, or
 possibly rename the source file to pl_gram.y.  Any preferences or other
 comments?

 Hmmm ... it's annoyed me for a long time that that file is named the
 same as the core backend's gram.y.  So renaming to pl_gram.y might be
 better.  On the other hand I have very little confidence in git's
 ability to preserve change history if we do that.  Has anyone actually
 done a file rename in a project with lots of history, and how well did
 it turn out?  (For instance, does git blame still provide any useful
 tracking of pre-rename changes?  If you try to cherry-pick a patch
 against the new file into a pre-rename branch, does it work?)

git handles renaming just fine with cherry-picks, no special options
necessary. (Well, there are probably corner cases, but it's code,
there are always corner cases!)

For git log, you'll want to add the --follow parameter if you're
asking for the history of a specific file or directory beyond a
renaming event.

git blame will show you the commit that renamed the file, by default,
but then you can request the revision prior to that using the commit
hash || '^', for example. git blame 2fb6cc90^ --
src/backend/parser/gram.y to work your way back through history.

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Doc typo: lexems - lexemes

2012-09-11 Thread Dan Scott
I ran across a minor typo while reviewing the full-text search
documentation. Attached is a patch to address the one usage of lexems
in a sea of lexemes.

diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
new file mode 100644
index 978aa54..5305198
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
*** ts_rank(optional replaceable class=P
*** 867,873 
  
listitem
 para
! Ranks vectors based on the frequency of their matching lexems.
 /para
/listitem
   /varlistentry
--- 867,873 
  
listitem
 para
! Ranks vectors based on the frequency of their matching lexemes.
 /para
/listitem
   /varlistentry

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers