Re: new syntax available

2015-12-31 Thread Dominique Pellé
2015-12-29 22:07 GMT+01:00 Dominique Pellé :

> Daniel Naber  wrote:
>
> > On 2015-10-14 14:01, Dominique Pellé wrote:
> ...
> >> It would also be useful if each group captured in the regexp
> >> could be re-used with \1 \2 \3 etc. (or  ...)  inside
> >> the  or .
> >
> > That's possible already.
>
> Hi
>
> Using mark="1" and \1 does not completely work with .
>
> I recently added the French rule DENUE_DENUDE but I could
> not get the suggestion to work, so I commented it out the
> suggestion for now.  The rule is...
>

[...snip...]


>
> However, suggestions do not work. If I uncomment the
> green part and comment out the red part, I get this:
>
> $ echo "Un film dénudé de tout intérêt." | java -jar
> ./languagetool-standalone/target/LanguageTool-3.3-SNAPSHOT/LanguageTool-3.3-SNAPSHOT/languagetool-commandline.jar
> -l fr -v
> Expected text language: French
> Working on STDIN...
> 2648 rules activated for language French
> 2648 rules activated for language French
>  Un[un/D m s] film[film/N m s] dénudé[dénuder/V ppa m s,dénuder/J m s]
> de[de/P] tout[tout/N m s] intérêt[intérêt/N m s].[./M fin,]
> Disambiguator log:
>
> un[1]: Un[un/N m s*,un/D m s*] -> Un[un/D m s*]
> RP-D_N_AMBIG[1]: Un[un/D m s*] -> Un[un/D m s*]
>
> N[3]: film[film/N m s] -> film[film/N m s]
> RP-D_N_AMBIG[1]: film[film/N m s] -> film[film/N m s]
>
> RB-PREPOSITION[1]: de[de/P] -> de[de/P]
>
> N[4]: tout[tout/N m s,tout/D m s,tout/A] -> tout[tout/N m s]
>
> N[3]: intérêt[intérêt/N m s] -> intérêt[intérêt/N m s]
>
> 1.) Line 1, column 9, Rule ID: DENUE_DENUDE[1]
> Message: Confusion probable entre « dénudé » et 'dénudé de tout intérêt'.
> Suggestion: dénudé de tout intérêt
> Un film dénudé de tout intérêt.
> ^^
>
> So the correct word is underlined, but the suggestion
> is incorrect.  I expected the suggestion "dénué"
> but instead I get "dénudé de tout intérêt".
>

I made a change to fix the suggestion in French rule
DENUE_DENUDE in this checkin:

https://github.com/languagetool-org/languagetool/commit/c5aafc71e1a753c8876f7c110a9ccba906aa49c6

However, it's probably a workaround, as I would normally
expect  in the suggestion to be replaced
by the first capture, just like \1. Instead,  
in suggestion is replaced by the full part that matches the
regexp.

Regards
Dominique
--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-12-29 Thread Dominique Pellé
Daniel Naber  wrote:

> On 2015-10-14 14:01, Dominique Pellé wrote:
...
>> It would also be useful if each group captured in the regexp
>> could be re-used with \1 \2 \3 etc. (or  ...)  inside
>> the  or .
>
> That's possible already.

Hi

Using mark="1" and \1 does not completely work with .

I recently added the French rule DENUE_DENUDE but I could
not get the suggestion to work, so I commented it out the
suggestion for now.  The rule is...


  \b(?:(dénudée?s?) (?:de |d['´‘’′])(?:toute?s?
)?(?:âme|apitoiement|ambigu[ïi]té|ambition|beauté|cause|charme|charisme|clarté|compassion|compétence|confort|connaissance|conscience|consistance|constance|contenu|contrepartie|crainte|créativité|culture|cynisme|difficulté|discrimination|envergure|intér[êe]t|émotion|esthétique|éthique|enjeux?|expertise|fantaisie|fondement|gentillesse|go[uû]t|grâce|haine|humanité|motif|imagination|inspiration|intention|inventivité|légitimité|logique|objectivité|maturité|méchanceté|mérite|paix|piété|plan|pertinence|peur|plaisir|politesse|préjugé|principe|professionnalisme|psychologie|qualité|raison|réalisme|remord|respect|revendication|rigueur|risque|sagesse|savoir|sens|sentiment|science|scrupule|soupçon|stress|sympathie|tact|talent|tendresse|toxicité|tromperie|valeur|vertu|violence|vision)s?)\b
  
  
  Confusion probable entre « dénudé » et « dénué ».
  http://bdl.oqlf.gouv.qc.ca/bdl/gabarit_bdl.asp?id=2542
  Un film dénudé de tout
intérêt.
  Une personne dénudée de toute
compassion.
  Une histoire dénudée
d’intérêt.
  Des hommes dénudés de
compassion.


The rule as above appears to work, it highights only
the expected captured word \1  (i.e. the word dénudé):

$ echo "Un film dénudé de tout intérêt." | java -jar
./languagetool-standalone/target/LanguageTool-3.3-SNAPSHOT/LanguageTool-3.3-SNAPSHOT/languagetool-commandline.jar
-l fr -v
Expected text language: French
Working on STDIN...
2648 rules activated for language French
2648 rules activated for language French
 Un[un/D m s] film[film/N m s] dénudé[dénuder/V ppa m s,dénuder/J m s]
de[de/P] tout[tout/N m s] intérêt[intérêt/N m s].[./M fin,]
Disambiguator log:

un[1]: Un[un/N m s*,un/D m s*] -> Un[un/D m s*]
RP-D_N_AMBIG[1]: Un[un/D m s*] -> Un[un/D m s*]

N[3]: film[film/N m s] -> film[film/N m s]
RP-D_N_AMBIG[1]: film[film/N m s] -> film[film/N m s]

RB-PREPOSITION[1]: de[de/P] -> de[de/P]

N[4]: tout[tout/N m s,tout/D m s,tout/A] -> tout[tout/N m s]

N[3]: intérêt[intérêt/N m s] -> intérêt[intérêt/N m s]

1.) Line 1, column 9, Rule ID: DENUE_DENUDE[1]
Message: Confusion probable entre « dénudé » et « dénué ».
Un film dénudé de tout intérêt.
^^

However, suggestions do not work. If I uncomment the
green part and comment out the red part, I get this:

$ echo "Un film dénudé de tout intérêt." | java -jar
./languagetool-standalone/target/LanguageTool-3.3-SNAPSHOT/LanguageTool-3.3-SNAPSHOT/languagetool-commandline.jar
-l fr -v
Expected text language: French
Working on STDIN...
2648 rules activated for language French
2648 rules activated for language French
 Un[un/D m s] film[film/N m s] dénudé[dénuder/V ppa m s,dénuder/J m s]
de[de/P] tout[tout/N m s] intérêt[intérêt/N m s].[./M fin,]
Disambiguator log:

un[1]: Un[un/N m s*,un/D m s*] -> Un[un/D m s*]
RP-D_N_AMBIG[1]: Un[un/D m s*] -> Un[un/D m s*]

N[3]: film[film/N m s] -> film[film/N m s]
RP-D_N_AMBIG[1]: film[film/N m s] -> film[film/N m s]

RB-PREPOSITION[1]: de[de/P] -> de[de/P]

N[4]: tout[tout/N m s,tout/D m s,tout/A] -> tout[tout/N m s]

N[3]: intérêt[intérêt/N m s] -> intérêt[intérêt/N m s]

1.) Line 1, column 9, Rule ID: DENUE_DENUDE[1]
Message: Confusion probable entre « dénudé » et 'dénudé de tout intérêt'.
Suggestion: dénudé de tout intérêt
Un film dénudé de tout intérêt.
^^

So the correct word is underlined, but the suggestion
is incorrect.  I expected the suggestion "dénué"
but instead I get "dénudé de tout intérêt".

Regards
Dominique
--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-11-17 Thread Daniel Naber
On 2015-10-07 06:41, Dominique Pellé wrote:

> How about something like this?
> 
>   (a) (nouveau|plein temps|rude épreuve|vol
> d['’´`‘]oiseau)
> 
> ... where the marker="1" attribute indicates to underline the captured
> group #1 in the regxp, i.e. the word "a" in above example?  What to 
> underline
> could even possibly be a portion of word.

This has now been implemented, only that the attribute is called "mark".

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: new syntax available

2015-11-04 Thread Daniel Naber
On 2015-11-04 15:52, Mike Unwalla wrote:

> The rule gives me a message for the first example, but not for the 
> second
> example:
> 1. Firstly, do the first test and secondly, do the second test.
> 2. Firstly, do the first test. Secondly, do the second test.

 are sentence-based, so it's expected that the second case 
doesn't match.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: new syntax available

2015-11-04 Thread Mike Unwalla
I have this rule:
  
  firstly.*secondly
This is a test of a regexp rule.
Regexp test
First, do the first test. Second, do the
test again.
Firstly, do the first test and
secondly, do the test again.
Firstly, do the first test.
Secondly, do the test again.
  

Testrules gives no error messages.

The rule gives me a message for the first example, but not for the second
example:
1. Firstly, do the first test and secondly, do the second test.
2. Firstly, do the first test. Secondly, do the second test.

Should the rule find the text in the second example or is  designed
to match text only if all the text is within a sentence?

Regards,

Mike Unwalla
Contact: www.techscribe.co.uk/techw/contact.htm 

-Original Message-
From: Daniel Naber [mailto:daniel.na...@languagetool.org] 


you can now use

foo

But be aware that this is a real regular expression that ignores tokens, 
so it matches anything with the substring 'foo'. 





--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-14 Thread Daniel Naber
On 2015-10-14 14:01, Dominique Pellé wrote:

> Thanks for the (?:\s+) change!
> How about... (?:[\sxA0]+)  instead?

Done (written as \u00A0 in the regex).

> Being able to highlight part the regexp would be useful with
> ...(...)  Most of the places
> where I'm thinking of using  would need it.

It's still on my TODO list, but time is very limited.

> It would also be useful if each group captured in the regexp
> could be re-used with \1 \2 \3 etc. (or  ...)  inside
> the  or .

That's possible already.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-14 Thread Dominique Pellé
Daniel Naber  wrote:

> On 2015-10-11 12:31, Daniel Naber wrote:
>
> >> Use of "exact-meaning" would be very rare.
> >> Maybe a better name: 
> >
> > I think that's okay with me, but I need to think more about it. Maybe
> > the easiest implementation would be to just replace " " by "\s+" before
> > the regex is applied (but not in "[...]")?
>
> I've just committed a change so that  are now 'smart' by
> default, i.e. you can use a space in the regex and it will internally be
> converted to "\s+" (actually even to "(?:\s+)").
>
> I also wanted the smart type to add \b around the regex, but it's not
> that easy. For example, if you have Dr\., you'd get the
> expression "\bDr\.\b", which will not match when e.g. a space follows,
> as the dot is not a boundary character. I'll search for a better
> solution.

Thanks for the (?:\s+) change!
How about... (?:[\sxA0]+)  instead?

If the automatic \b is not easy, then we should not bother.
I can see in your example why it's not easy.
Adding \b manually is OK.

Being able to highlight part the regexp would be useful with
...(...)  Most of the places
where I'm thinking of using  would need it.

It would also be useful if each group captured in the regexp
could be re-used with \1 \2 \3 etc. (or  ...)  inside
the  or .

Thanks again
Dominique

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-14 Thread Daniel Naber
On 2015-10-11 12:31, Daniel Naber wrote:

>> Use of "exact-meaning" would be very rare.
>> Maybe a better name: 
> 
> I think that's okay with me, but I need to think more about it. Maybe
> the easiest implementation would be to just replace " " by "\s+" before
> the regex is applied (but not in "[...]")?

I've just committed a change so that  are now 'smart' by 
default, i.e. you can use a space in the regex and it will internally be 
converted to "\s+" (actually even to "(?:\s+)").

I also wanted the smart type to add \b around the regex, but it's not 
that easy. For example, if you have Dr\., you'd get the 
expression "\bDr\.\b", which will not match when e.g. a space follows, 
as the dot is not a boundary character. I'll search for a better 
solution.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-11 Thread Daniel Naber
On 2015-10-10 06:16, Dominique Pellé wrote:

> I'm not sure I understand how it would work for users.

My idea was that it would work automatically. But you're right that 
users might also paste text with lines breaks, and my idea of having a 
parsing or normalization (when reading the input) outside of core might 
not work properly with that case.

> Use of "exact-meaning" would be very rare.
> Maybe a better name: 

I think that's okay with me, but I need to think more about it. Maybe 
the easiest implementation would be to just replace " " by "\s+" before 
the regex is applied (but not in "[...]")?

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-09 Thread Purodha Blissenbach
On 10.10.2015 06:16, Dominique Pellé wrote:
> Daniel Naber wrote:
>
>> On 2015-10-09 07:32, Dominique Pellé wrote:
>>
>>> I suppose that I care more than most because I only use LT to check
>>> text files where the situation is frequent.
>>
>> I think normalizing the text makes sense if:
>> 1) single line breaks get removed from plain text files (but not 
>> double
>> spaces)
>> 2) this normalization doesn't happen in LT core, but in the 
>> command-line
>> client
>>
>> My understanding is that's not enough for your use case as you use
>> spaces for indentation? For me, this sounds like a general input 
>> format
>> issue, just like people want to use LT to check LaTeX. We cannot 
>> support
>> that in the core, but if we find a way to do it outside that would 
>> be
>> okay for me. We just need to avoid becoming a parser for every 
>> format
>> out there.
>>
>> We already have the concept of annotated text[1], I think this could 
>> be
>> used to check plain text files. "\n" is then markup just like "" 
>> is
>> markup in XML. So we don't need normalization in that sense, but we 
>> need
>> to parse the input.
>>
>> [1]
>> 
>> https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html
>
> I'm not sure I understand how it would work for users.
> Would users have to give an option? Command line, or check box
> for the GUI? That seems unfortunate, since it worked well before
> without specifying an option, which users may not be aware of.
>
> I wonder how many users copy paste text in the web interface
> of LT. Those users will also have degraded experience.
>
> I seem to be the only one really bothered with the regression.
> I don't mean to be too negative about it. I like the new 
> feature, but I don't like the regression because text format is
> ubiquitous and many text files use multiple double spaces as
> well as line breaks in sentences.
>
> I could instead use \s+ in regexp for fr, eo, br that I maintain.
> But it's not nice if only those 3 languages work.
> And yes, it would clutter regexps, but I'd still find it acceptable.
>
> Mike Unwalla wrote:
>
>> I understand why you want to preprocess text. Sometimes, I have a 
>> similar
>> problem. Sometimes, I want to ignore multiple spaces, line breaks, 
>> and tab
>> characters.
>>
>> However, automatically ignoring such text could cause problems. For 
>> example,
>> not all double spaces are errors. For the Netherlands, "there should 
>> be a
>> double space between the postcode and the post town"
>> 
>> (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-Western-Europe).
>
> That's true.  It's a rare case, but it's good to be able to detect
> such errors.
>
> Ironically, the example given in your link does not respect
> the rule it preaches for the Dutch address since I see only one space
> between the postal codes in the post town in "2312 BK LEIDEN".
> The address in Luxembourg is also misspelled (Longway -> Longwy)
> but that's off-topic.
>
> Your link gives me the idea of writing semantic rules to check
> address formating in various countries. Examples of rules for
> checking addresses in France:
> - house number should be before street name
> - postal code should be before city name
> - postal code should be 5 digits without space (29200 is ok, 29 200 
> is wrong)
> - etc.
>
> Good example:
> 23 Rue de l’église
> 29200 BREST
> FRANCE
>
> Bad example (postal code after city name):
>23 Rue de l’église
>BREST 29200
>FRANCE
>
> The  feature will be great for such rules.
> Something like this may work (no tested)
>
> 
> \b(Rue|Avenue|Av\.|Place|Pl\.|Boulvevard|Boul\.)\s.*\n\s+\d{5}\s+\p{Lu}.*\n\s+FRANCE\b
>
>
>> I did not mean that you should not preprocess text. I meant that you 
>> should
>> not mess with the meaning of a regexp.
>>
>> Possibly, we can solve the conflict by having 2 types of :
>> 
>> 
>
> That would be ideal in my opinion.
> Use of "exact-meaning" would be very rare.
> Maybe a better name: 
>
> Regards
> Dominique

How about making preprocessing explicit in the rule set like this:


   foo bar
   ...

foo  bar

Purodha

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-09 Thread Dominique Pellé
Daniel Naber wrote:

> On 2015-10-09 07:32, Dominique Pellé wrote:
>
>> I suppose that I care more than most because I only use LT to check
>> text files where the situation is frequent.
>
> I think normalizing the text makes sense if:
> 1) single line breaks get removed from plain text files (but not double
> spaces)
> 2) this normalization doesn't happen in LT core, but in the command-line
> client
>
> My understanding is that's not enough for your use case as you use
> spaces for indentation? For me, this sounds like a general input format
> issue, just like people want to use LT to check LaTeX. We cannot support
> that in the core, but if we find a way to do it outside that would be
> okay for me. We just need to avoid becoming a parser for every format
> out there.
>
> We already have the concept of annotated text[1], I think this could be
> used to check plain text files. "\n" is then markup just like "" is
> markup in XML. So we don't need normalization in that sense, but we need
> to parse the input.
>
> [1]
> https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html

I'm not sure I understand how it would work for users.
Would users have to give an option? Command line, or check box
for the GUI? That seems unfortunate, since it worked well before
without specifying an option, which users may not be aware of.

I wonder how many users copy paste text in the web interface
of LT. Those users will also have degraded experience.

I seem to be the only one really bothered with the regression.
I don't mean to be too negative about it. I like the new 
feature, but I don't like the regression because text format is
ubiquitous and many text files use multiple double spaces as
well as line breaks in sentences.

I could instead use \s+ in regexp for fr, eo, br that I maintain.
But it's not nice if only those 3 languages work.
And yes, it would clutter regexps, but I'd still find it acceptable.

Mike Unwalla wrote:

> I understand why you want to preprocess text. Sometimes, I have a similar
> problem. Sometimes, I want to ignore multiple spaces, line breaks, and tab
> characters.
>
> However, automatically ignoring such text could cause problems. For example,
> not all double spaces are errors. For the Netherlands, "there should be a
> double space between the postcode and the post town"
> (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-Western-Europe).

That's true.  It's a rare case, but it's good to be able to detect
such errors.

Ironically, the example given in your link does not respect
the rule it preaches for the Dutch address since I see only one space
between the postal codes in the post town in "2312 BK LEIDEN".
The address in Luxembourg is also misspelled (Longway -> Longwy)
but that's off-topic.

Your link gives me the idea of writing semantic rules to check
address formating in various countries. Examples of rules for
checking addresses in France:
- house number should be before street name
- postal code should be before city name
- postal code should be 5 digits without space (29200 is ok, 29 200 is wrong)
- etc.

Good example:
23 Rue de l’église
29200 BREST
FRANCE

Bad example (postal code after city name):
   23 Rue de l’église
   BREST 29200
   FRANCE

The  feature will be great for such rules.
Something like this may work (no tested)

\b(Rue|Avenue|Av\.|Place|Pl\.|Boulvevard|Boul\.)\s.*\n\s+\d{5}\s+\p{Lu}.*\n\s+FRANCE\b


> I did not mean that you should not preprocess text. I meant that you should
> not mess with the meaning of a regexp.
>
> Possibly, we can solve the conflict by having 2 types of :
> 
> 

That would be ideal in my opinion.
Use of "exact-meaning" would be very rare.
Maybe a better name: 

Regards
Dominique

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-09 Thread Daniel Naber
On 2015-10-09 07:32, Dominique Pellé wrote:

> I suppose that I care more than most because I only use LT to check
> text files where the situation is frequent.

I think normalizing the text makes sense if:
1) single line breaks get removed from plain text files (but not double 
spaces)
2) this normalization doesn't happen in LT core, but in the command-line 
client

My understanding is that's not enough for your use case as you use 
spaces for indentation? For me, this sounds like a general input format 
issue, just like people want to use LT to check LaTeX. We cannot support 
that in the core, but if we find a way to do it outside that would be 
okay for me. We just need to avoid becoming a parser for every format 
out there.

We already have the concept of annotated text[1], I think this could be 
used to check plain text files. "\n" is then markup just like "" is 
markup in XML. So we don't need normalization in that sense, but we need 
to parse the input.

[1] 
https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: new syntax available

2015-10-09 Thread Mike Unwalla
I understand why you want to preprocess text. Sometimes, I have a similar
problem. Sometimes, I want to ignore multiple spaces, line breaks, and tab
characters.

However, automatically ignoring such text could cause problems. For example,
not all double spaces are errors. For the Netherlands, "there should be a
double space between the postcode and the post town"
(http://www.royalmail.com/personal/help-and-support/Addressing-your-items-We
stern-Europe).

I did not mean that you should not preprocess text. I meant that you should
not mess with the meaning of a regexp.

Possibly, we can solve the conflict by having 2 types of :



Regards,

Mike Unwalla
Contact: www.techscribe.co.uk/techw/contact.htm 

-Original Message-
From: Dominique Pellé [mailto:dominique.pe...@gmail.com] 
Sent: 09 October 2015 06:33

Mike Unwalla  wrote:
> I agree with Purodha. Do not be 'smart'. Do not change the meaning of a
regexp.
>
> Regards,
>
> Mike Unwalla


OK. It looks like the majority does not want to pre-processs the sentence
to remove consecutive spaces (including tabs, dos/unix new lines, form
feeds, vertical space, non breaking space) before matching the regexp.
So I will go with that.

On the other other hand, nobody indicates how to avoid the regression. A
line break for example in between words, typically doesn't happens in
LIbreOffice documents or in our tests, but often happens in text files. In
emails, line breaks are used to avoid lines longer than ~80 char. Taking the
German rule GIRLS_DAY for example, it will now fail to match when "girl's
day" is on a broken line as in this sentence. I see this as a severe
regression.



Regards
Dominique


--



--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-08 Thread Dominique Pellé
Mike Unwalla  wrote:
> I agree with Purodha. Do not be 'smart'. Do not change the meaning of a 
> regexp.
>
> Regards,
>
> Mike Unwalla


OK. It looks like the majority does not want to pre-processs the sentence
to remove consecutive spaces (including tabs, dos/unix new lines, form
feeds, vertical space, non breaking space) before matching the regexp.
So I will go with that.

On the other other hand, nobody indicates how to avoid the regression. A
line break for example in between words, typically doesn't happens in
LIbreOffice documents or in our tests, but often happens in text files. In
emails, line breaks are used to avoid lines longer than ~80 char. Taking the
German rule GIRLS_DAY for example, it will now fail to match when "girl's
day" is on a broken line as in this sentence. I see this as a severe
regression.

I suppose that I care more than most because I only use LT to check
text files where the situation is frequent.

For grammar.xml files that I maintain (br, eo, fr), I will use \s+ or even
[\sxA0]+ in the regexp to make it work.  But I can change later if
another solution is decided.

Regards
Dominique

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: new syntax available

2015-10-08 Thread Mike Unwalla
I agree with Purodha. Do not be 'smart'. Do not change the meaning of a regexp.

Regards,

Mike Unwalla
Contact: www.techscribe.co.uk/techw/contact.htm 

-Original Message-
From: Purodha Blissenbach [mailto:puro...@blissenbach.org] 

>
> 1) should we write regex like foo\s+bar
>
> 2) or should  be smart and automatically treat
> all sequences of spaces/tabs/newlines/unbreakable spaces
> as if it was one space?

I suggest version 1, since 2 would alter the usual meaning of regular 
expressions which I believe is a bad idea.

Purodha




--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-08 Thread Olivier (Grammalecte)
Le 07/10/2015 19:39, Dominique Pellé a écrit :

> Perhaps Oliver R.  in CC (author of Grammalecte) can comment on
> whether there is an implicit \b at beginning and end of regexps.
> Is the format of Grammalecte rules documented?

In Grammalecte, word boundaries are explicit.

The tags [Word], [word], [Char], [char] are commands to describe the
behaviour of following rules.

[Word] and [word] mean that word boundaries will added to every regexes
of following rules. [Char] and [char] mean that no word boundaries are
added to the following rules.

[Word] and [Char] mean that rules are case insensitive.
[word] and [char] mean that rules are case sensitive.

But that’s the old way.

In the new beta of Grammalecte (0.5.0b), word boundaries are still
explicit, but it’s easier to set parameters for casing and word boundaries.

At the beginning of each rule, there is tags for parameters and options.

__[i]__  Word boundaries on both side. Case insensitive.
  No word boundaries. Case sensitive.
__[u>__  Word boundary on left side only. Uppercase if you can.
__ etc. # Un seul point après « etc. »

is written now:

  __[i>/typo__ etc([.][.][.]|…) -> etc. # Un seul point après « etc. »


and

  [Word]
  __tu__  science fiction -> science-fiction   # Il manque…

is written now:

  __[i]/tu__  science fiction -> science-fiction   # Il manque…


HTH.

Regards,
Olivier


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-08 Thread Dominique Pellé
Daniel Naber wrote:

> On 2015-10-08 06:59, Dominique Pellé wrote:

>> ... then the regexp rule does not detect all the errors
>> that the  rule detected. It does not detect errors
>> in "foo  bar"  (2 spaces or more, or tabs) or when there is a
>> new line as in:
>>
>>   foo
>>   bar
>>
>> How to fix it?
>
> I don't think it should be fixed, as two consecutive spaces it usually
> an error that should be fixed first.

The double space is meant to be caught by another rule.
I have text files that are indented with spaces or sometimes
justified and so may use several spaces and they also contain
new lines in the middle of sentences. For such files, I disable
rule WHITESPACE_RULE, but I still want to catch other errors
like "foo  bar" as in my example that were caught before when
using ...


> Using \s+ for all spaces makes the
> regex very difficult to read.

I agree: it clutters the regexp, especially if it's in
many places, which is the opposite of the the goal
of  was precisely to make it easier to maintain.

That's why I also proposed solution 2).


Purodha Blissenbach wrote:

> I suggest version 1, since 2 would alter the usual
> meaning of regular expressions which I believe is
> a bad idea.

No necessarily.  The regexp could still be the unmodified regex.
It's the sentence that can be pre-processed before matching it
to replace all sequences of consecutive spaces (spaces,
tabs, new line and even other Unicode spaces) with a
single space.  So the regexp ends up being matched against
"foo bar" (1 space) instead of "foo  bar" (2 or more spaces).
Thinking further about it, this would be my preference.

Regards
Dominique

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-08 Thread Daniel Naber
On 2015-10-08 06:59, Dominique Pellé wrote:

> ... then the regexp rule does not detect all the errors
> that the  rule detected. It does not detect errors
> in "foo  bar"  (2 spaces or more, or tabs) or when there is a
> new line as in:
> 
>   foo
>   bar
> 
> How to fix it?

I don't think it should be fixed, as two consecutive spaces it usually 
an error that should be fixed first. Using \s+ for all spaces makes the 
regex very difficult to read.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-07 Thread Purodha Blissenbach


On 08.10.2015 06:59, Dominique Pellé wrote:
> Daniel Naber wrote:
>
>> On 2015-10-07 06:41, Dominique Pellé wrote:
>>
>> Hi Dominique,
>>
>> thanks for your feedback.
>
> One more remark:
>
> If I replace a rule like...
>
> 
>   foo
>   bar
> 
>
> ... into ...
>
> foo bar
>
> ... then the regexp rule does not detect all the errors
> that the  rule detected. It does not detect errors
> in "foo  bar"  (2 spaces or more, or tabs) or when there is a
> new line as in:
>
>   foo
>   bar
>
> How to fix it?
>
> 1) should we write regex like foo\s+bar
>
> 2) or should  be smart and automatically treat
> all sequences of spaces/tabs/newlines/unbreakable spaces
> as if it was one space?

I suggest version 1, since 2 would alter the usual meaning of regular 
expressions which I believe is a bad idea.

Purodha

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-07 Thread Dominique Pellé
Daniel Naber wrote:

> On 2015-10-07 06:41, Dominique Pellé wrote:
>
> Hi Dominique,
>
> thanks for your feedback.

One more remark:

If I replace a rule like...


  foo
  bar


... into ...

foo bar

... then the regexp rule does not detect all the errors
that the  rule detected. It does not detect errors
in "foo  bar"  (2 spaces or more, or tabs) or when there is a
new line as in:

  foo
  bar

How to fix it?

1) should we write regex like foo\s+bar

2) or should  be smart and automatically treat
all sequences of spaces/tabs/newlines/unbreakable spaces
as if it was one space?

Regards
Dominique

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-07 Thread Dominique Pellé
Daniel Naber  wrote:

> On 2015-10-07 06:41, Dominique Pellé wrote:
>
> Hi Dominique,
>
> thanks for your feedback.
>
>> 1) How do I highlight only a subset of the match?   Trying the above
>> rule, I see this:
>
> That's not yet possible, but I like the idea of a 'marker' attribute.
> I'll add that to my TODO list.

Good.

>> 2) Is there always an implicit word boundary at the beginning or end
>> of ?
>
> There's no implicit boundary. How does Grammalecte deal with this?

I'm not sure.

I see rules like this with explicit \b:

__typo__  \betc([.][.][.]|…) -> etc. # Un seul point après « etc. »

On the other hand, most rules are without \b like this:

__tu__  science fiction -> science-fiction   # Il manque un trait d’union.

Perhaps Oliver R.  in CC (author of Grammalecte) can comment on
whether there is an implicit \b at beginning and end of regexps.
Is the format of Grammalecte rules documented?

I think that the best for LT would be to use \b implicitly at
beginning and end of the regexp, but have a option to disable it
with  which will rarely need to be
used. I can't think of a good short name for that option.

>> I wonder whether there is a performance impact.
>
> I just ran a performance test and changing 320 German rules to regex
> makes checking ~10% slower. For me, 10% is not a value I care about,
> especially as other languages like English are much slower anyway.

OK. 10% isn't that small in my opinion.  I'll probably end
up using  only when it helps to reduce >= 2 rules
into 1 rule, mostly because it makes grammar.xml more
maintainable. Maybe having less rules will then even
compensate the slowdown due to regexp matching on
sentences.

The slow down could depend on the text you check: I'd
expect it to be worse on very long phrases if Java DFA regexp
engine is worse than O(n) for some regexp, where n is the number
of char in the matched sentence. But if regexp are simple enough,
they will not trigger complexity worse than O(n) I suspect.

Regards
Dominique

--
Full-scale, agent-less Infrastructure Monitoring from a single dashboard
Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
Physical-Virtual-Cloud Infrastructure monitoring from one console
Real user monitoring with APM Insights and performance trend reports 
Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-07 Thread Daniel Naber
On 2015-10-07 06:41, Dominique Pellé wrote:

Hi Dominique,

thanks for your feedback.

> 1) How do I highlight only a subset of the match?   Trying the above
> rule, I see this:

That's not yet possible, but I like the idea of a 'marker' attribute. 
I'll add that to my TODO list.

> 2) Is there always an implicit word boundary at the beginning or end
> of ?

There's no implicit boundary. How does Grammalecte deal with this?

> I wonder whether there is a performance impact.

I just ran a performance test and changing 320 German rules to regex 
makes checking ~10% slower. For me, 10% is not a value I care about, 
especially as other languages like English are much slower anyway.

Regards
  Daniel


--
Full-scale, agent-less Infrastructure Monitoring from a single dashboard
Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
Physical-Virtual-Cloud Infrastructure monitoring from one console
Real user monitoring with APM Insights and performance trend reports 
Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-06 Thread Dominique Pellé
Daniel Naber  wrote:

> Hi,
>
> there's now a first and limited implementation of the  syntax in
> master. Instead of
>
> foo
>
> you can now use
>
> foo
>
> But be aware that this is a real regular expression that ignores tokens,
> so it matches anything with the substring 'foo'. Also, the regular
> expression is case-insensitive by default. You can have a look at the
> German grammar.xml for many examples.
>
> To make use of these, you can adapt and run RuleSimplifier in the dev
> package. It tries to convert simple rules automatically, but it's just a
> hack, the new rules need to be tested and adapted manually. It also only
> touches rules without '' elements. There's no  for
> regexp, it's always the complete match that will be underlined. You
> obviously cannot use the regex to access the part-of-speech tags of the
> match. But replacements are also limited, e.g. changing case currently
> doesn't work. By using \1 you can access the first matching group, i.e.
> the first parenthesis group of the regexp etc.
>
> Please let me know how this works for you.
>
> Regards
>   Daniel


Hi Daniel

First of all, thanks for implementing it.  But I have questions or remarks :-)

To me, the idea of  is useful when we can replace many pattern
rules into a single one, which helps to reduce the number of rules and
improve maintenability of the rules. I see this example in German:


&eigenname;

(girl|boy)['’`´‘]s day
Meinten Sie den Aktionstag \1s’
Day?
Der Girl's
Day findet einmal im Jahr statt.
Der Boy`s
Day ist ein Aktionstag.


(girls|boys)[´`‘] day
Meinten Sie den Aktionstag \1’
Day?
&eigenname;
Der Boys’ Day findet einmal
im Jahr statt.
Der Boys`
Day ist ein Aktionstag.


(girls|boys) day
Meinten Sie den Aktionstag \1’
Day?
Der Girls
Day findet einmal im Jahr statt.
Der Boys
Day ist ein Aktionstag.


(girls|boys)['’`´‘]day
Meinten Sie den Aktionstag \1’
Day?
Der
Girls'Day findet einmal im Jahr statt.
Der
Boys`Day ist ein Aktionstag.



The idea of  is that it should be now possible to have a
single rule instead
of many rules, using something more or less like this:


  (girl|boy)s?[´`‘]?s? day
  ...



I have not used  and I have questions before I used it.

1) How do I highlight only a subset of the match?   Trying the above
rule, I see this:

  Line 1, column 8, Rule ID: GIRLS_DAY[1]
  Message: Meinten Sie den Aktionstag 'girls’ Day'?
  Suggestion: girls’ Day
  It's a girl's day.
 ^^

But what if I wanted to highlight only the word girl? Maybe
highlighting the full pattern
is OK in the above example, but where I'd like to use , I do
not want to highlight
the full pattern but only part of it, possibly a single word.  For
example in those
French expressions...

a nouveau -> à nouveau
a plein temps -> à plein temps
a rude épreuve -> à rude épreuve
a vol d'oiseau -> à vol d'oiseau
... (etc, more cases in reality...)

I'm thinking of creating such a rule:

  a (nouveau|plein temps|rude épreuve|vol d['’´`‘]oiseau)

... but how to say to hightlight/underline the word "a" only?  I don't
see the equivalent of 

How about something like this?

  (a) (nouveau|plein temps|rude épreuve|vol
d['’´`‘]oiseau)

... where the marker="1" attribute indicates to underline the captured
group #1 in the regxp, i.e. the word "a" in above example?  What to underline
could even possibly be a portion of word.


2) Is there always an implicit word boundary at the beginning or end
of ?

In the German grammar, I see this for example:
 Elisabeth (Selber|Selberth)\b

Since I see \b at the end, it suggests that there is no implicit \b.
But I don't see it at the beginning, so I wonder whether \b is needed or not.
Having an attribute to  to disable implicit \b could
be useful sometimes. Enabling it by default would be best.

3) I see that the German grammar now uses  in many rules, even for
very simple patterns like:

  
zu letzt¬

I wonder whether there is a performance impact.  Here, the older way
of using  still seemed acceptable to me, and possibly faster
(no regexp). Keep in mind that regexp matching of long phraes can
be slow for some regexps.  This depends on the regexp
engine.  DFA regexp engine should be O(n) where n is the length of
the line I think, whereas NFA engines can be much, much slower.
But DFA engine typically are slower to compile regexp than NFA,
use more memory and have more limitation (often, no back references).

See https://swtch.com/~rsc/regexp/regexp1.html

Java uses a NFA regexp engine I think.  So this means that regexp matching
of long phrases could run into 

Re: new syntax available

2015-10-06 Thread Daniel Naber
On 2015-10-06 22:02, Jaume Ortolà i Font wrote:

> Thanks, Daniel. It is very useful.
> 
> Do you suggest converting all simple rules to this new syntax? Do you
> expect some improvement in performance?

I leave that up to the maintainer of each language. The reason for this 
feature is that some rules can be expressed clearer and/or more compact 
with the new syntax. I don't expect performance improvements.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-06 Thread Jaume Ortolà i Font
Thanks, Daniel. It is very useful.

Do you suggest converting all simple rules to this new syntax? Do you
expect some improvement in performance?

Regards,
Jaume


2015-10-05 14:59 GMT+02:00 Daniel Naber :

> Hi,
>
> there's now a first and limited implementation of the  syntax in
> master. Instead of
>
> foo
>
> you can now use
>
> foo
>
> But be aware that this is a real regular expression that ignores tokens,
> so it matches anything with the substring 'foo'. Also, the regular
> expression is case-insensitive by default. You can have a look at the
> German grammar.xml for many examples.
>
> To make use of these, you can adapt and run RuleSimplifier in the dev
> package. It tries to convert simple rules automatically, but it's just a
> hack, the new rules need to be tested and adapted manually. It also only
> touches rules without '' elements. There's no  for
> regexp, it's always the complete match that will be underlined. You
> obviously cannot use the regex to access the part-of-speech tags of the
> match. But replacements are also limited, e.g. changing case currently
> doesn't work. By using \1 you can access the first matching group, i.e.
> the first parenthesis group of the regexp etc.
>
> Please let me know how this works for you.
>
> Regards
>   Daniel
>
>
>
> --
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new syntax available

2015-10-06 Thread Jan Schreiber
That's great news, Daniel! A very powerful new feature.

Am 05.10.2015 14:59, schrieb Daniel Naber:
> Hi,
> 
> there's now a first and limited implementation of the  syntax in 
> master. Instead of
> 
> foo
> 
> you can now use
> 
> foo
> 

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


new syntax available

2015-10-05 Thread Daniel Naber
Hi,

there's now a first and limited implementation of the  syntax in 
master. Instead of

foo

you can now use

foo

But be aware that this is a real regular expression that ignores tokens, 
so it matches anything with the substring 'foo'. Also, the regular 
expression is case-insensitive by default. You can have a look at the 
German grammar.xml for many examples.

To make use of these, you can adapt and run RuleSimplifier in the dev 
package. It tries to convert simple rules automatically, but it's just a 
hack, the new rules need to be tested and adapted manually. It also only 
touches rules without '' elements. There's no  for 
regexp, it's always the complete match that will be underlined. You 
obviously cannot use the regex to access the part-of-speech tags of the 
match. But replacements are also limited, e.g. changing case currently 
doesn't work. By using \1 you can access the first matching group, i.e. 
the first parenthesis group of the regexp etc.

Please let me know how this works for you.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel