Re: [BUGS] Bug with Tsearch and tsvector

2010-05-01 Thread Jasen Betts
On 2010-04-29, Tom Lane t...@sss.pgh.pa.us wrote: Jasen Betts ja...@xnet.co.nz writes: \ is popular in URIs on some platfroms, or is URI a different beast I hope not, because \ is explicitly disallowed by both the older and newer versions of that RFC. I should have known better than to

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-29 Thread Jasen Betts
On 2010-04-26, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Tom Lane t...@sss.pgh.pa.us wrote: From the RFC: | control = US-ASCII coded characters 00-1F and 7F hexadecimal | space = US-ASCII coded character 20 hexadecimal | delims = | | # | % | | unwise = { | }

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-29 Thread Tom Lane
Jasen Betts ja...@xnet.co.nz writes: \ is popular in URIs on some platfroms, or is URI a different beast I hope not, because \ is explicitly disallowed by both the older and newer versions of that RFC. I did think of proposing that we allow \ and : in FilePath, which is currently pretty

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: We'd probably not want to apply this as-is, but should first tighten up what characters URLPath allows, per Kevin's spec research. If we're headed that way, I figured I should double-check. The RFC I referenced was later obsoleted by:

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane t...@sss.pgh.pa.us wrote: We'd probably not want to apply this as-is, but should first tighten up what characters URLPath allows, per Kevin's spec research. If we're headed that way, I figured I should double-check. The RFC I

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane t...@sss.pgh.pa.us wrote: We'd probably not want to apply this as-is, but should first tighten up what characters URLPath allows, per Kevin's spec research. If we're headed that way, I figured I

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote: I'll read this RFC closely and follow up later today. For anyone not clear on what a URI is compared to a URL, every URL is also a URI (but not the other way around): A URI can be further classified as a locator, a name, or both. The term

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: I think that we should accept all the above characters (reserved and unreserved) and the percent character (since it is the escape character) as part of a URL. Check. I don't know whether we should try to extract components of the URL, but

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-27 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: reserved= gen-delims / sub-delims gen-delims = : / / / ? / # / [ / ] / @ sub-delims = ! / $ / / ' / ( / ) / * / + / , / ; / = unreserved = ALPHA / DIGIT / - / . / _ / ~ I think that we should

[BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Donald Fraser
PostgreSQL 8.3.10 (on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)) OS: Linux Redhat EL 5.4 Database encoding: LATIN9 Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type. The fail condition is shown

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
Donald Fraser postg...@kiwi-fraser.net writes: Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type. ts_debug shows that it's being parsed like this: alias | description | token

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: ie the critical point seems to be that url_path is willing to soak up a string containing and , so the span tags don't get recognized as separate lexemes. While that's obviously the wrong thing in this particular example, I'm not sure if it's the wrong

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane t...@sss.pgh.pa.us wrote: ie the critical point seems to be that url_path is willing to soak up a string containing and , so the span tags don't get recognized as separate lexemes. While that's obviously the wrong thing in this

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
I wrote: Donald Fraser postg...@kiwi-fraser.net writes: Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type. ts_debug shows that it's being parsed like this: alias | description | token

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: Hmm, thanks for the reference, but I'm not sure this is specifying quite what we want to get at. In particular I note that it excludes '%' on the grounds that that ought to be escaped, so I guess this is specifying the characters allowed in an underlying

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote: there's a potential compatibility issue here, so my thought is to apply this only in HEAD. Agreed. -Kevin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Bug with Tsearch and tsvector

2010-04-26 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes: Hmm. Having typed that, I'm staring at the # character, which is used to mark off an anchor within an HTML page identified by the URL. Should we consider the # and anchor part of a URL? Yeah, I would think so. This discussion is making me