Re: [HACKERS] old bug in full text parser

2016-02-10 Thread Mike Rylander
On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov  wrote:
> It  looks like there is a very old bug in full text parser (somebody pointed
> me on it), which appeared after moving tsearch2 into the core.  The problem
> is in how full text parser process hyphenated words. Our original idea was
> to report hyphenated word itself as well as its parts and ignore hyphen.
> That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>

The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance.  Strings like that end up being parsed as two
lexems, "1979" and "-2014".  We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us.  Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?

Thanks!

--
Mike Rylander

> After  investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane 
> Date:   Sat Oct 27 19:03:45 2007 +
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word.  Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
>   tok_type   |  description  |  token
> -+---+--
>  lhword  | Latin hyphenated word | dot-four
>  lpart_hword | Latin part of hyphenated word | dot
>  lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
>   tok_type   |  description  | token
> -+---+---
>  hword   | Hyphenated word   | dot-4
>  lpart_hword | Latin part of hyphenated word | dot
>  uint| Unsigned integer  | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
>  tok_type |   description| token
> --+--+---
>  uint | Unsigned integer | 4
>  lword| Latin word   | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
>   alias  |   description   |  token
> -+-+--
>  asciihword  | Hyphenated word, all ASCII  | dot-four
>  hword_asciipart | Hyphenated word part, all ASCII | dot
>  blank   | Space symbols   | -
>  hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
>alias   |   description   | token
> ---+-+---
>  asciiword | Word, all ASCII | dot
>  int   | Signed integer  | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
>alias   |   description| token
> ---+--+---
>  uint  | Unsigned integer | 4
>  blank | Space symbols| -
>  asciiword | Word, all ASCII  | dot
> (3 rows)
>
>
> Regards,
> Oleg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] old bug in full text parser

2016-02-10 Thread Tom Lane
Oleg Bartunov  writes:
> It  looks like there is a very old bug in full text parser (somebody
> pointed me on it), which appeared after moving tsearch2 into the core.  The
> problem is in how full text parser process hyphenated words. Our original
> idea was to report hyphenated word itself as well as its parts and ignore
> hyphen. That was how tsearch2 works.

> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.

> I think we should consider this as a bug and produce fix for all supported
> versions.

I don't see anything here that looks like a bug, more like a definition
disagreement.  As such, I'd be pretty dubious about back-patching a
change.  But it's hard to debate the merits when you haven't said exactly
what you'd do instead.

I believe the commit you mention was intended to fix this inconsistency:

http://www.postgresql.org/message-id/6269.1193184...@sss.pgh.pa.us

so I would be against simply reverting it.  In any case, the examples
given there make it look like there was already inconsistency about mixed
words and numbers.  Do we really think that "4-dot" should be considered
a hyphenated word?  I'm not sure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] old bug in full text parser

2016-02-10 Thread Oleg Bartunov
On Wed, Feb 10, 2016 at 7:21 PM, Tom Lane  wrote:

> Oleg Bartunov  writes:
> > It  looks like there is a very old bug in full text parser (somebody
> > pointed me on it), which appeared after moving tsearch2 into the core.
> The
> > problem is in how full text parser process hyphenated words. Our original
> > idea was to report hyphenated word itself as well as its parts and ignore
> > hyphen. That was how tsearch2 works.
>
> > This behaviour was changed after moving tsearch2 into the core:
> > 1. hyphen now reported by parser, which is useless.
> > 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed
> differently
> > than ones with plain text words like 'four-dot', no hyphenated word
> itself
> > reported.
>
> > I think we should consider this as a bug and produce fix for all
> supported
> > versions.
>
> I don't see anything here that looks like a bug, more like a definition
> disagreement.  As such, I'd be pretty dubious about back-patching a
> change.  But it's hard to debate the merits when you haven't said exactly
> what you'd do instead.
>

Yeah, better say not bug, but inconsistency. We definitely should work on
better
"consistent" parser with predicted behaviour.


>
> I believe the commit you mention was intended to fix this inconsistency:
>
> http://www.postgresql.org/message-id/6269.1193184...@sss.pgh.pa.us
>
> so I would be against simply reverting it.  In any case, the examples
> given there make it look like there was already inconsistency about mixed
> words and numbers.  Do we really think that "4-dot" should be considered
> a hyphenated word?  I'm not sure.
>

I agree, that we shouldn't  just revert it.  My idea is to work on new
parser and leave old as is for compatibility reason. Fortunately, fts is
flexible enough, so we could add new parser at any time as an extension.



>
> regards, tom lane
>


Re: [HACKERS] old bug in full text parser

2016-02-10 Thread Oleg Bartunov
On Wed, Feb 10, 2016 at 7:45 PM, Mike Rylander  wrote:

> On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov 
> wrote:
> > It  looks like there is a very old bug in full text parser (somebody
> pointed
> > me on it), which appeared after moving tsearch2 into the core.  The
> problem
> > is in how full text parser process hyphenated words. Our original idea
> was
> > to report hyphenated word itself as well as its parts and ignore hyphen.
> > That was how tsearch2 works.
> >
> > This behaviour was changed after moving tsearch2 into the core:
> > 1. hyphen now reported by parser, which is useless.
> > 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed
> differently
> > than ones with plain text words like 'four-dot', no hyphenated word
> itself
> > reported.
> >
> > I think we should consider this as a bug and produce fix for all
> supported
> > versions.
> >
>
> The Evergreen project has long depended on tsearch2 (both as an
> extension and in-core FTS), and one thing we've struggled with is date
> range parsing such as birth and death years for authors in the form of
> 1979-2014, for instance.  Strings like that end up being parsed as two
> lexems, "1979" and "-2014".  We work around this by pre-normalizing
> strings matching /(\d+)-(\d+)/ into two numbers separated by a space
> instead of a hyphen, but if fixing this bug would remove the need for
> such a preprocessing step it would be a great help to us.  Would such
> strings be parsed "properly" into lexems of the form of "1979" and
> "2014" with you proposed change?
>
>
I'd love to consider all hyphenated "words" in one way, disregarding to
what is "a word", number of plain text, namely,  'w1-w2' should be reported
as {'w1-w2', 'w1', 'w2'}. The problem is in definition of "word".


We'll definitely look on parser again, fortunately, we could just fork
default parser and develop new one to not break compatibility. You have
chance to help us to produce "consistent" view of what tokens new parser
should recognize and how process them.





> Thanks!
>
> --
> Mike Rylander
>
> > After  investigation we found this commit:
> >
> > commit 73e6f9d3b61995525785b2f4490b465fe860196b
> > Author: Tom Lane 
> > Date:   Sat Oct 27 19:03:45 2007 +
> >
> > Change text search parsing rules for hyphenated words so that digit
> > strings
> > containing decimal points aren't considered part of a hyphenated
> word.
> > Sync the hyphenated-word lookahead states with the subsequent
> > part-by-part
> > reparsing states so that we don't get different answers about how
> much
> > text
> > is part of the hyphenated word.  Per my gripe of a few days ago.
> >
> >
> > 8.2.23
> >
> > select tok_type, description, token from ts_debug('dot-four');
> >   tok_type   |  description  |  token
> > -+---+--
> >  lhword  | Latin hyphenated word | dot-four
> >  lpart_hword | Latin part of hyphenated word | dot
> >  lpart_hword | Latin part of hyphenated word | four
> > (3 rows)
> >
> > select tok_type, description, token from ts_debug('dot-4');
> >   tok_type   |  description  | token
> > -+---+---
> >  hword   | Hyphenated word   | dot-4
> >  lpart_hword | Latin part of hyphenated word | dot
> >  uint| Unsigned integer  | 4
> > (3 rows)
> >
> > select tok_type, description, token from ts_debug('4-dot');
> >  tok_type |   description| token
> > --+--+---
> >  uint | Unsigned integer | 4
> >  lword| Latin word   | dot
> > (2 rows)
> >
> > 8.3.23
> >
> > select alias, description, token from ts_debug('dot-four');
> >   alias  |   description   |  token
> > -+-+--
> >  asciihword  | Hyphenated word, all ASCII  | dot-four
> >  hword_asciipart | Hyphenated word part, all ASCII | dot
> >  blank   | Space symbols   | -
> >  hword_asciipart | Hyphenated word part, all ASCII | four
> > (4 rows)
> >
> > select alias, description, token from ts_debug('dot-4');
> >alias   |   description   | token
> > ---+-+---
> >  asciiword | Word, all ASCII | dot
> >  int   | Signed integer  | -4
> > (2 rows)
> >
> > select alias, description, token from ts_debug('4-dot');
> >alias   |   description| token
> > ---+--+---
> >  uint  | Unsigned integer | 4
> >  blank | Space symbols| -
> >  asciiword | Word, all ASCII  | dot
> > (3 rows)
> >
> >
> > Regards,
> > Oleg
>


Re: [HACKERS] old bug in full text parser

2016-02-10 Thread Oleg Bartunov
On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov  wrote:

> It  looks like there is a very old bug in full text parser (somebody
> pointed me on it), which appeared after moving tsearch2 into the core.  The
> problem is in how full text parser process hyphenated words. Our original
> idea was to report hyphenated word itself as well as its parts and ignore
> hyphen. That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed
> differently than ones with plain text words like 'four-dot', no hyphenated
> word itself reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>
> After  investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane 
> Date:   Sat Oct 27 19:03:45 2007 +
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word.  Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
>   tok_type   |  description  |  token
> -+---+--
>  lhword  | Latin hyphenated word | dot-four
>  lpart_hword | Latin part of hyphenated word | dot
>  lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
>   tok_type   |  description  | token
> -+---+---
>  hword   | Hyphenated word   | dot-4
>  lpart_hword | Latin part of hyphenated word | dot
>  uint| Unsigned integer  | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
>  tok_type |   description| token
> --+--+---
>  uint | Unsigned integer | 4
>  lword| Latin word   | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
>   alias  |   description   |  token
> -+-+--
>  asciihword  | Hyphenated word, all ASCII  | dot-four
>  hword_asciipart | Hyphenated word part, all ASCII | dot
>  blank   | Space symbols   | -
>  hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
>alias   |   description   | token
> ---+-+---
>  asciiword | Word, all ASCII | dot
>  int   | Signed integer  | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
>alias   |   description| token
> ---+--+---
>  uint  | Unsigned integer | 4
>  blank | Space symbols| -
>  asciiword | Word, all ASCII  | dot
> (3 rows)
>
>

Oh, one more bug, which existed even in tsearch2.

select tok_type, description, token from ts_debug('4-dot');
 tok_type |   description| token
--+--+---
 uint | Unsigned integer | 4
 lword| Latin word   | dot
(2 rows)




>
> Regards,
> Oleg
>


[HACKERS] old bug in full text parser

2016-02-10 Thread Oleg Bartunov
It  looks like there is a very old bug in full text parser (somebody
pointed me on it), which appeared after moving tsearch2 into the core.  The
problem is in how full text parser process hyphenated words. Our original
idea was to report hyphenated word itself as well as its parts and ignore
hyphen. That was how tsearch2 works.

This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed differently
than ones with plain text words like 'four-dot', no hyphenated word itself
reported.

I think we should consider this as a bug and produce fix for all supported
versions.

After  investigation we found this commit:

commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane 
Date:   Sat Oct 27 19:03:45 2007 +

Change text search parsing rules for hyphenated words so that digit
strings
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent
part-by-part
reparsing states so that we don't get different answers about how much
text
is part of the hyphenated word.  Per my gripe of a few days ago.


8.2.23

select tok_type, description, token from ts_debug('dot-four');
  tok_type   |  description  |  token
-+---+--
 lhword  | Latin hyphenated word | dot-four
 lpart_hword | Latin part of hyphenated word | dot
 lpart_hword | Latin part of hyphenated word | four
(3 rows)

select tok_type, description, token from ts_debug('dot-4');
  tok_type   |  description  | token
-+---+---
 hword   | Hyphenated word   | dot-4
 lpart_hword | Latin part of hyphenated word | dot
 uint| Unsigned integer  | 4
(3 rows)

select tok_type, description, token from ts_debug('4-dot');
 tok_type |   description| token
--+--+---
 uint | Unsigned integer | 4
 lword| Latin word   | dot
(2 rows)

8.3.23

select alias, description, token from ts_debug('dot-four');
  alias  |   description   |  token
-+-+--
 asciihword  | Hyphenated word, all ASCII  | dot-four
 hword_asciipart | Hyphenated word part, all ASCII | dot
 blank   | Space symbols   | -
 hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)

select alias, description, token from ts_debug('dot-4');
   alias   |   description   | token
---+-+---
 asciiword | Word, all ASCII | dot
 int   | Signed integer  | -4
(2 rows)

select alias, description, token from ts_debug('4-dot');
   alias   |   description| token
---+--+---
 uint  | Unsigned integer | 4
 blank | Space symbols| -
 asciiword | Word, all ASCII  | dot
(3 rows)


Regards,
Oleg