Re: [patch] improved equivalent classes in regular expressions

2013-01-24 Fir de Conversatie Christian Brabandt
Hi Tony!

On Do, 24 Jan 2013, Tony Mechelynck wrote:

> What do you mean, "the Swiss may think otherwise"? IIUC, in the
> de_CH standard the eszett is not used, it is always replaced by ss,
> because the Swiss have no room for it on their trilingual (well,
> quadrilingual, even) typewriter keyboards. Hence the well-known slur
> against them:
> 
> — Wie trinken die Schweizer Bier?
>   ("How do the Swiss drink beer?")
> — In Masse.
>   ("massively", where for any other German-speaking country except
> maybe Liechtenstein it would of course be "in Maße", "in
> moderation").

I thought the Swiss used to replace ß by sz but that is apparently 
wrong, as you pointed out correctly.

Mit freundlichen Grüßen
Christian
-- 

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php





Re: [patch] improved equivalent classes in regular expressions

2013-01-24 Fir de Conversatie Tony Mechelynck

On 23/01/13 22:08, Christian Brabandt wrote:

Hi Dominique!

On Mo, 21 Jan 2013, Dominique Pellé wrote:


You obviously speak better German than me, but isn't the German
ess-zett equivalent to ss rather than sz? I'm curious why /sz.


You got me ;)
Of course esszett is,  despite its name, equivalent to ss and that is
what the standard actually demands (Although the Swiss might think
otherwise). Sorry for the confusion.

regards,
Christian



What do you mean, "the Swiss may think otherwise"? IIUC, in the de_CH 
standard the eszett is not used, it is always replaced by ss, because 
the Swiss have no room for it on their trilingual (well, quadrilingual, 
even) typewriter keyboards. Hence the well-known slur against them:


— Wie trinken die Schweizer Bier?
("How do the Swiss drink beer?")
— In Masse.
	("massively", where for any other German-speaking country except maybe 
Liechtenstein it would of course be "in Maße", "in moderation").



Best regards,
Tony.
--
Speak roughly to your little boy,
And beat him when he sneezes:
He only does it to annoy
Because he knows it teases.

Wow!  wow!  wow!

I speak severely to my boy,
And beat him when he sneezes:
For he can thoroughly enjoy
The pepper when he pleases!

Wow!  wow!  wow!
-- Lewis Carrol, "Alice in Wonderland"

--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php





Re: [patch] improved equivalent classes in regular expressions

2013-01-24 Fir de Conversatie Christian Brabandt
Hi Joachim!

On Do, 24 Jan 2013, Joachim Schmitz wrote:

> But still, while ß is equivalent to ss, the oposite is not true,
> only few ss are equivalent to ß.
> Same for ä,ö,ü and ae, oe, ue, equivalent in one direction but not
> the other.

Indeed, but when we are talking about equivalence classes regarding 
regular expressions, then ss and ß are equal.

regards,
Christian
-- 
Der beste Teil der Schönheit ist der, den ein Bild nicht wiedergeben
kann.
-- Francis Bacon

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php





Re: [patch] improved equivalent classes in regular expressions

2013-01-24 Fir de Conversatie Joachim Schmitz

Christian Brabandt wrote:

Hi Dominique!

On Mo, 21 Jan 2013, Dominique Pellé wrote:


You obviously speak better German than me, but isn't the German
ess-zett equivalent to ss rather than sz? I'm curious why /sz.


You got me ;)
Of course esszett is,  despite its name, equivalent to ss and that is
what the standard actually demands (Although the Swiss might think
otherwise). Sorry for the confusion.



But still, while ß is equivalent to ss, the oposite is not true, only few ss 
are equivalent to ß.
Same for ä,ö,ü and ae, oe, ue, equivalent in one direction but not the 
other.


Bye, Jojo 



--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php





Re: [patch] improved equivalent classes in regular expressions

2013-01-23 Fir de Conversatie Christian Brabandt
Hi Dominique!

On Mo, 21 Jan 2013, Dominique Pellé wrote:

> You obviously speak better German than me, but isn't the German
> ess-zett equivalent to ss rather than sz? I'm curious why /sz.

You got me ;)
Of course esszett is,  despite its name, equivalent to ss and that is 
what the standard actually demands (Although the Swiss might think 
otherwise). Sorry for the confusion.

regards,
Christian
-- 
Zeit ist das, was man an der Uhr abliest.
-- Albert Einstein

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php





Re: [patch] improved equivalent classes in regular expressions

2013-01-21 Fir de Conversatie Dominique Pellé
Christian Brabandt wrote:

> Hi Dominique!
>
> On Mi, 16 Jan 2013, Dominique Pellé wrote:
>
>> When using equivalent class [[=x=]], I realized that what I
>> generally want, is to use it on the full strings rather than on
>> a single characters. Searching for "foobar" with...
>>
>> /[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]
>>
>> ... works but is rather unpleasant.  I wish there was a flag
>> such as \q switch on equivalent class, which would
>> work like \c for case insensitivity. So instead of the above
>> regexp, I could search for:
>>
>> /\qfoobar
>>
>> As far as I know \q is unused in Vim regexp, so
>> that should not break compatibility.
>>
>> Maybe there could also be a function normalize({expr}}
>> (any better name?) that given a string with diacritics
>> "fňóbâr" returns "foobar" in similar way to tolower({expr}})
>> which returns a lowercase version of the string.
>>
>> Before I spend time trying to do that, would it be useful
>> and accepted?
>
> Indeed, that looks like a useful addition.

I have no time now for that unfortunately, but maybe in a few weeks.

> I have another idea with regards to equivalence classes:
> When searching for /[[=ß=]] this should translate into /sz. But that is
> more complicated, since a search for /[s][z] wouldn't match ß (eszet)
> anymore.

You obviously speak better German than me, but isn't the German
ess-zett equivalent to ss rather than sz? I'm curious why /sz.

>> Regarding the few characters that are no longer equivalent,
>> I find it odd from a user point of view. For example U+01e4
>> (LATIN CAPITAL LETTER G WITH STROKE) was equivalent
>> to uppercase G but it is no longer equivalent to G.
>> Yet some other letters with stroke are still equivalent.
>> For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
>> is still equivalent to L. It seems inconsistent, even if that's
>> what the ISO standard says. Previous behavior made more
>> sense to me for U+1e4 at least.
>
> Fixed with the latest patch.

Yes, I saw that. Thanks!

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: [patch] improved equivalent classes in regular expressions

2013-01-21 Fir de Conversatie Christian Brabandt
Hi Dominique!

On Mi, 16 Jan 2013, Dominique Pellé wrote:

> When using equivalent class [[=x=]], I realized that what I
> generally want, is to use it on the full strings rather than on
> a single characters. Searching for "foobar" with...
> 
> /[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]
> 
> ... works but is rather unpleasant.  I wish there was a flag
> such as \q switch on equivalent class, which would
> work like \c for case insensitivity. So instead of the above
> regexp, I could search for:
> 
> /\qfoobar
> 
> As far as I know \q is unused in Vim regexp, so
> that should not break compatibility.
> 
> Maybe there could also be a function normalize({expr}}
> (any better name?) that given a string with diacritics
> "fňóbâr" returns "foobar" in similar way to tolower({expr}})
> which returns a lowercase version of the string.
> 
> Before I spend time trying to do that, would it be useful
> and accepted?

Indeed, that looks like a useful addition.

I have another idea with regards to equivalence classes:
When searching for /[[=ß=]] this should translate into /sz. But that is 
more complicated, since a search for /[s][z] wouldn't match ß (eszet) 
anymore. 

> Regarding the few characters that are no longer equivalent,
> I find it odd from a user point of view. For example U+01e4
> (LATIN CAPITAL LETTER G WITH STROKE) was equivalent
> to uppercase G but it is no longer equivalent to G.
> Yet some other letters with stroke are still equivalent.
> For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
> is still equivalent to L. It seems inconsistent, even if that's
> what the ISO standard says. Previous behavior made more
> sense to me for U+1e4 at least.

Fixed with the latest patch.

Mit freundlichen Grüßen
Christian
-- 
Alkoholismus: Gift und Gegengift sind identisch.
-- Gerhard Uhlenbruck

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: [patch] improved equivalent classes in regular expressions

2013-01-17 Fir de Conversatie Bram Moolenaar

Christian Brabandt wrote:

> Bram,
> I recently discovered, that using equivalence classes in regular 
> expressions did not match all expected characters. Also I think, the 
> current implementation does not work as expected, since searching for 
> [[=Ä=]] does only match Ä and neither A nor any other A like character.
> 
> So I looked into the standard¹ and found that apparently not all 
> characters are matched according to it.

I don't think the documentation says that it works according to any
standard.  If we go this way, we need to make sure we are actually using
the right standard for this functionality.

> I wrote a testfile² that contains all character codes that need to match 
> for /[[=A=]]. If you search for /[[=A=]]$ you'll see, that some 
> characters are skipped.
> 
> So I threw together a small vim script³, that parses the given standard 
> file and generates a huge switch statement to be used in the function 
> reg_equi_class() of the regexp.c in the Vim source.
> 
> Using this generated code in regexp.c, I created this patch⁴, which 
> successfully matches all expected characters from that testfile. It also 
> adds equivalence classes for the 10 digits 0-9 (and added some missing 
> equivalence classes, e.g. for 'Q')
> 
> However, some characters are now missing from the equivalence classes, 
> like most notably U01E4 U01E5 U0149 U0166 U0167 U01B5 U01B6 since they 
> are defined to have different primary weight than their Ascii 
> counterparts (G g n T t Z z), so I removed those chars from test44

Hmm, doesn't this indicate the standard is not right for this purpose?


-- 
   Bravely bold Sir Robin, rode forth from Camelot,
   He was not afraid to die, Oh Brave Sir Robin,
   He was not at all afraid to be killed in nasty ways
   Brave, brave, brave, brave Sir Robin.
 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- b...@moolenaar.net -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\  an exciting new programming language -- http://www.Zimbu.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php


Re: [patch] improved equivalent classes in regular expressions

2013-01-15 Fir de Conversatie Dominique Pellé
Christian Brabandt  wrote:

> Bram,
> I recently discovered, that using equivalence classes in regular
> expressions did not match all expected characters. Also I think, the
> current implementation does not work as expected, since searching for
> [[=Ä=]] does only match Ä and neither A nor any other A like character.
>
> So I looked into the standard¹ and found that apparently not all
> characters are matched according to it.
>
> I wrote a testfile² that contains all character codes that need to match
> for /[[=A=]]. If you search for /[[=A=]]$ you'll see, that some
> characters are skipped.
>
> So I threw together a small vim script³, that parses the given standard
> file and generates a huge switch statement to be used in the function
> reg_equi_class() of the regexp.c in the Vim source.
>
> Using this generated code in regexp.c, I created this patch⁴, which
> successfully matches all expected characters from that testfile. It also
> adds equivalence classes for the 10 digits 0-9 (and added some missing
> equivalence classes, e.g. for 'Q')
>
> However, some characters are now missing from the equivalence classes,
> like most notably U01E4 U01E5 U0149 U0166 U0167 U01B5 U01B6 since they
> are defined to have different primary weight than their Ascii
> counterparts (G g n T t Z z), so I removed those chars from test44
>
> regards,
> Christian
>
> ¹) ISO-14651:2012, available for free at
> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
> If you download the zipfile below "ISO/IEC 14651:2011/Amd 1:2012", it
> contains the full reference table ISO14651_2012_TABLE1_en.txt (and
> should be equivalent to the Unicode standard)
> ²) Attached file A.txt
> ³) Attached file parse_iso14651.vim
> ⁴) Attached file new_equivalent_class.diff

Thanks Christian

I have not tried the patch yet but it looks like a nice improvement.

When using equivalent class [[=x=]], I realized that what I
generally want, is to use it on the full strings rather than on
a single characters. Searching for "foobar" with...

/[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]

... works but is rather unpleasant.  I wish there was a flag
such as \q switch on equivalent class, which would
work like \c for case insensitivity. So instead of the above
regexp, I could search for:

/\qfoobar

As far as I know \q is unused in Vim regexp, so
that should not break compatibility.

Maybe there could also be a function normalize({expr}}
(any better name?) that given a string with diacritics
"fòóbâr" returns "foobar" in similar way to tolower({expr}})
which returns a lowercase version of the string.

Before I spend time trying to do that, would it be useful
and accepted?

Regarding the few characters that are no longer equivalent,
I find it odd from a user point of view. For example U+01e4
(LATIN CAPITAL LETTER G WITH STROKE) was equivalent
to uppercase G but it is no longer equivalent to G.
Yet some other letters with stroke are still equivalent.
For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
is still equivalent to L. It seems inconsistent, even if that's
what the ISO standard says. Previous behavior made more
sense to me for U+1e4 at least.

Regards
-- Dominique

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php