Re: new syntax available
2015-12-29 22:07 GMT+01:00 Dominique Pellé : > Daniel Naber wrote: > > > On 2015-10-14 14:01, Dominique Pellé wrote: > ... > >> It would also be useful if each group captured in the regexp > >> could be re-used with \1 \2 \3 etc. (or ...) inside > >> the or . > > > > That's possible already. > > Hi > > Using mark="1" and \1 does not completely work with . > > I recently added the French rule DENUE_DENUDE but I could > not get the suggestion to work, so I commented it out the > suggestion for now. The rule is... > [...snip...] > > However, suggestions do not work. If I uncomment the > green part and comment out the red part, I get this: > > $ echo "Un film dénudé de tout intérêt." | java -jar > ./languagetool-standalone/target/LanguageTool-3.3-SNAPSHOT/LanguageTool-3.3-SNAPSHOT/languagetool-commandline.jar > -l fr -v > Expected text language: French > Working on STDIN... > 2648 rules activated for language French > 2648 rules activated for language French > Un[un/D m s] film[film/N m s] dénudé[dénuder/V ppa m s,dénuder/J m s] > de[de/P] tout[tout/N m s] intérêt[intérêt/N m s].[./M fin,] > Disambiguator log: > > un[1]: Un[un/N m s*,un/D m s*] -> Un[un/D m s*] > RP-D_N_AMBIG[1]: Un[un/D m s*] -> Un[un/D m s*] > > N[3]: film[film/N m s] -> film[film/N m s] > RP-D_N_AMBIG[1]: film[film/N m s] -> film[film/N m s] > > RB-PREPOSITION[1]: de[de/P] -> de[de/P] > > N[4]: tout[tout/N m s,tout/D m s,tout/A] -> tout[tout/N m s] > > N[3]: intérêt[intérêt/N m s] -> intérêt[intérêt/N m s] > > 1.) Line 1, column 9, Rule ID: DENUE_DENUDE[1] > Message: Confusion probable entre « dénudé » et 'dénudé de tout intérêt'. > Suggestion: dénudé de tout intérêt > Un film dénudé de tout intérêt. > ^^ > > So the correct word is underlined, but the suggestion > is incorrect. I expected the suggestion "dénué" > but instead I get "dénudé de tout intérêt". > I made a change to fix the suggestion in French rule DENUE_DENUDE in this checkin: https://github.com/languagetool-org/languagetool/commit/c5aafc71e1a753c8876f7c110a9ccba906aa49c6 However, it's probably a workaround, as I would normally expect in the suggestion to be replaced by the first capture, just like \1. Instead, in suggestion is replaced by the full part that matches the regexp. Regards Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > On 2015-10-14 14:01, Dominique Pellé wrote: ... >> It would also be useful if each group captured in the regexp >> could be re-used with \1 \2 \3 etc. (or ...) inside >> the or . > > That's possible already. Hi Using mark="1" and \1 does not completely work with . I recently added the French rule DENUE_DENUDE but I could not get the suggestion to work, so I commented it out the suggestion for now. The rule is... \b(?:(dénudée?s?) (?:de |d['´‘’′])(?:toute?s? )?(?:âme|apitoiement|ambigu[ïi]té|ambition|beauté|cause|charme|charisme|clarté|compassion|compétence|confort|connaissance|conscience|consistance|constance|contenu|contrepartie|crainte|créativité|culture|cynisme|difficulté|discrimination|envergure|intér[êe]t|émotion|esthétique|éthique|enjeux?|expertise|fantaisie|fondement|gentillesse|go[uû]t|grâce|haine|humanité|motif|imagination|inspiration|intention|inventivité|légitimité|logique|objectivité|maturité|méchanceté|mérite|paix|piété|plan|pertinence|peur|plaisir|politesse|préjugé|principe|professionnalisme|psychologie|qualité|raison|réalisme|remord|respect|revendication|rigueur|risque|sagesse|savoir|sens|sentiment|science|scrupule|soupçon|stress|sympathie|tact|talent|tendresse|toxicité|tromperie|valeur|vertu|violence|vision)s?)\b Confusion probable entre « dénudé » et « dénué ». http://bdl.oqlf.gouv.qc.ca/bdl/gabarit_bdl.asp?id=2542 Un film dénudé de tout intérêt. Une personne dénudée de toute compassion. Une histoire dénudée d’intérêt. Des hommes dénudés de compassion. The rule as above appears to work, it highights only the expected captured word \1 (i.e. the word dénudé): $ echo "Un film dénudé de tout intérêt." | java -jar ./languagetool-standalone/target/LanguageTool-3.3-SNAPSHOT/LanguageTool-3.3-SNAPSHOT/languagetool-commandline.jar -l fr -v Expected text language: French Working on STDIN... 2648 rules activated for language French 2648 rules activated for language French Un[un/D m s] film[film/N m s] dénudé[dénuder/V ppa m s,dénuder/J m s] de[de/P] tout[tout/N m s] intérêt[intérêt/N m s].[./M fin,] Disambiguator log: un[1]: Un[un/N m s*,un/D m s*] -> Un[un/D m s*] RP-D_N_AMBIG[1]: Un[un/D m s*] -> Un[un/D m s*] N[3]: film[film/N m s] -> film[film/N m s] RP-D_N_AMBIG[1]: film[film/N m s] -> film[film/N m s] RB-PREPOSITION[1]: de[de/P] -> de[de/P] N[4]: tout[tout/N m s,tout/D m s,tout/A] -> tout[tout/N m s] N[3]: intérêt[intérêt/N m s] -> intérêt[intérêt/N m s] 1.) Line 1, column 9, Rule ID: DENUE_DENUDE[1] Message: Confusion probable entre « dénudé » et « dénué ». Un film dénudé de tout intérêt. ^^ However, suggestions do not work. If I uncomment the green part and comment out the red part, I get this: $ echo "Un film dénudé de tout intérêt." | java -jar ./languagetool-standalone/target/LanguageTool-3.3-SNAPSHOT/LanguageTool-3.3-SNAPSHOT/languagetool-commandline.jar -l fr -v Expected text language: French Working on STDIN... 2648 rules activated for language French 2648 rules activated for language French Un[un/D m s] film[film/N m s] dénudé[dénuder/V ppa m s,dénuder/J m s] de[de/P] tout[tout/N m s] intérêt[intérêt/N m s].[./M fin,] Disambiguator log: un[1]: Un[un/N m s*,un/D m s*] -> Un[un/D m s*] RP-D_N_AMBIG[1]: Un[un/D m s*] -> Un[un/D m s*] N[3]: film[film/N m s] -> film[film/N m s] RP-D_N_AMBIG[1]: film[film/N m s] -> film[film/N m s] RB-PREPOSITION[1]: de[de/P] -> de[de/P] N[4]: tout[tout/N m s,tout/D m s,tout/A] -> tout[tout/N m s] N[3]: intérêt[intérêt/N m s] -> intérêt[intérêt/N m s] 1.) Line 1, column 9, Rule ID: DENUE_DENUDE[1] Message: Confusion probable entre « dénudé » et 'dénudé de tout intérêt'. Suggestion: dénudé de tout intérêt Un film dénudé de tout intérêt. ^^ So the correct word is underlined, but the suggestion is incorrect. I expected the suggestion "dénué" but instead I get "dénudé de tout intérêt". Regards Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-07 06:41, Dominique Pellé wrote: > How about something like this? > > (a) (nouveau|plein temps|rude épreuve|vol > d['’´`‘]oiseau) > > ... where the marker="1" attribute indicates to underline the captured > group #1 in the regxp, i.e. the word "a" in above example? What to > underline > could even possibly be a portion of word. This has now been implemented, only that the attribute is called "mark". Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: new syntax available
On 2015-11-04 15:52, Mike Unwalla wrote: > The rule gives me a message for the first example, but not for the > second > example: > 1. Firstly, do the first test and secondly, do the second test. > 2. Firstly, do the first test. Secondly, do the second test. are sentence-based, so it's expected that the second case doesn't match. Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: new syntax available
I have this rule: firstly.*secondly This is a test of a regexp rule. Regexp test First, do the first test. Second, do the test again. Firstly, do the first test and secondly, do the test again. Firstly, do the first test. Secondly, do the test again. Testrules gives no error messages. The rule gives me a message for the first example, but not for the second example: 1. Firstly, do the first test and secondly, do the second test. 2. Firstly, do the first test. Secondly, do the second test. Should the rule find the text in the second example or is designed to match text only if all the text is within a sentence? Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Daniel Naber [mailto:daniel.na...@languagetool.org] you can now use foo But be aware that this is a real regular expression that ignores tokens, so it matches anything with the substring 'foo'. -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-14 14:01, Dominique Pellé wrote: > Thanks for the (?:\s+) change! > How about... (?:[\sxA0]+) instead? Done (written as \u00A0 in the regex). > Being able to highlight part the regexp would be useful with > ...(...) Most of the places > where I'm thinking of using would need it. It's still on my TODO list, but time is very limited. > It would also be useful if each group captured in the regexp > could be re-used with \1 \2 \3 etc. (or ...) inside > the or . That's possible already. Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > On 2015-10-11 12:31, Daniel Naber wrote: > > >> Use of "exact-meaning" would be very rare. > >> Maybe a better name: > > > > I think that's okay with me, but I need to think more about it. Maybe > > the easiest implementation would be to just replace " " by "\s+" before > > the regex is applied (but not in "[...]")? > > I've just committed a change so that are now 'smart' by > default, i.e. you can use a space in the regex and it will internally be > converted to "\s+" (actually even to "(?:\s+)"). > > I also wanted the smart type to add \b around the regex, but it's not > that easy. For example, if you have Dr\., you'd get the > expression "\bDr\.\b", which will not match when e.g. a space follows, > as the dot is not a boundary character. I'll search for a better > solution. Thanks for the (?:\s+) change! How about... (?:[\sxA0]+) instead? If the automatic \b is not easy, then we should not bother. I can see in your example why it's not easy. Adding \b manually is OK. Being able to highlight part the regexp would be useful with ...(...) Most of the places where I'm thinking of using would need it. It would also be useful if each group captured in the regexp could be re-used with \1 \2 \3 etc. (or ...) inside the or . Thanks again Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-11 12:31, Daniel Naber wrote: >> Use of "exact-meaning" would be very rare. >> Maybe a better name: > > I think that's okay with me, but I need to think more about it. Maybe > the easiest implementation would be to just replace " " by "\s+" before > the regex is applied (but not in "[...]")? I've just committed a change so that are now 'smart' by default, i.e. you can use a space in the regex and it will internally be converted to "\s+" (actually even to "(?:\s+)"). I also wanted the smart type to add \b around the regex, but it's not that easy. For example, if you have Dr\., you'd get the expression "\bDr\.\b", which will not match when e.g. a space follows, as the dot is not a boundary character. I'll search for a better solution. Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-10 06:16, Dominique Pellé wrote: > I'm not sure I understand how it would work for users. My idea was that it would work automatically. But you're right that users might also paste text with lines breaks, and my idea of having a parsing or normalization (when reading the input) outside of core might not work properly with that case. > Use of "exact-meaning" would be very rare. > Maybe a better name: I think that's okay with me, but I need to think more about it. Maybe the easiest implementation would be to just replace " " by "\s+" before the regex is applied (but not in "[...]")? Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 10.10.2015 06:16, Dominique Pellé wrote: > Daniel Naber wrote: > >> On 2015-10-09 07:32, Dominique Pellé wrote: >> >>> I suppose that I care more than most because I only use LT to check >>> text files where the situation is frequent. >> >> I think normalizing the text makes sense if: >> 1) single line breaks get removed from plain text files (but not >> double >> spaces) >> 2) this normalization doesn't happen in LT core, but in the >> command-line >> client >> >> My understanding is that's not enough for your use case as you use >> spaces for indentation? For me, this sounds like a general input >> format >> issue, just like people want to use LT to check LaTeX. We cannot >> support >> that in the core, but if we find a way to do it outside that would >> be >> okay for me. We just need to avoid becoming a parser for every >> format >> out there. >> >> We already have the concept of annotated text[1], I think this could >> be >> used to check plain text files. "\n" is then markup just like "" >> is >> markup in XML. So we don't need normalization in that sense, but we >> need >> to parse the input. >> >> [1] >> >> https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html > > I'm not sure I understand how it would work for users. > Would users have to give an option? Command line, or check box > for the GUI? That seems unfortunate, since it worked well before > without specifying an option, which users may not be aware of. > > I wonder how many users copy paste text in the web interface > of LT. Those users will also have degraded experience. > > I seem to be the only one really bothered with the regression. > I don't mean to be too negative about it. I like the new > feature, but I don't like the regression because text format is > ubiquitous and many text files use multiple double spaces as > well as line breaks in sentences. > > I could instead use \s+ in regexp for fr, eo, br that I maintain. > But it's not nice if only those 3 languages work. > And yes, it would clutter regexps, but I'd still find it acceptable. > > Mike Unwalla wrote: > >> I understand why you want to preprocess text. Sometimes, I have a >> similar >> problem. Sometimes, I want to ignore multiple spaces, line breaks, >> and tab >> characters. >> >> However, automatically ignoring such text could cause problems. For >> example, >> not all double spaces are errors. For the Netherlands, "there should >> be a >> double space between the postcode and the post town" >> >> (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-Western-Europe). > > That's true. It's a rare case, but it's good to be able to detect > such errors. > > Ironically, the example given in your link does not respect > the rule it preaches for the Dutch address since I see only one space > between the postal codes in the post town in "2312 BK LEIDEN". > The address in Luxembourg is also misspelled (Longway -> Longwy) > but that's off-topic. > > Your link gives me the idea of writing semantic rules to check > address formating in various countries. Examples of rules for > checking addresses in France: > - house number should be before street name > - postal code should be before city name > - postal code should be 5 digits without space (29200 is ok, 29 200 > is wrong) > - etc. > > Good example: > 23 Rue de l’église > 29200 BREST > FRANCE > > Bad example (postal code after city name): >23 Rue de l’église >BREST 29200 >FRANCE > > The feature will be great for such rules. > Something like this may work (no tested) > > > \b(Rue|Avenue|Av\.|Place|Pl\.|Boulvevard|Boul\.)\s.*\n\s+\d{5}\s+\p{Lu}.*\n\s+FRANCE\b > > >> I did not mean that you should not preprocess text. I meant that you >> should >> not mess with the meaning of a regexp. >> >> Possibly, we can solve the conflict by having 2 types of : >> >> > > That would be ideal in my opinion. > Use of "exact-meaning" would be very rare. > Maybe a better name: > > Regards > Dominique How about making preprocessing explicit in the rule set like this: foo bar ... foo bar Purodha -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > On 2015-10-09 07:32, Dominique Pellé wrote: > >> I suppose that I care more than most because I only use LT to check >> text files where the situation is frequent. > > I think normalizing the text makes sense if: > 1) single line breaks get removed from plain text files (but not double > spaces) > 2) this normalization doesn't happen in LT core, but in the command-line > client > > My understanding is that's not enough for your use case as you use > spaces for indentation? For me, this sounds like a general input format > issue, just like people want to use LT to check LaTeX. We cannot support > that in the core, but if we find a way to do it outside that would be > okay for me. We just need to avoid becoming a parser for every format > out there. > > We already have the concept of annotated text[1], I think this could be > used to check plain text files. "\n" is then markup just like "" is > markup in XML. So we don't need normalization in that sense, but we need > to parse the input. > > [1] > https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html I'm not sure I understand how it would work for users. Would users have to give an option? Command line, or check box for the GUI? That seems unfortunate, since it worked well before without specifying an option, which users may not be aware of. I wonder how many users copy paste text in the web interface of LT. Those users will also have degraded experience. I seem to be the only one really bothered with the regression. I don't mean to be too negative about it. I like the new feature, but I don't like the regression because text format is ubiquitous and many text files use multiple double spaces as well as line breaks in sentences. I could instead use \s+ in regexp for fr, eo, br that I maintain. But it's not nice if only those 3 languages work. And yes, it would clutter regexps, but I'd still find it acceptable. Mike Unwalla wrote: > I understand why you want to preprocess text. Sometimes, I have a similar > problem. Sometimes, I want to ignore multiple spaces, line breaks, and tab > characters. > > However, automatically ignoring such text could cause problems. For example, > not all double spaces are errors. For the Netherlands, "there should be a > double space between the postcode and the post town" > (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-Western-Europe). That's true. It's a rare case, but it's good to be able to detect such errors. Ironically, the example given in your link does not respect the rule it preaches for the Dutch address since I see only one space between the postal codes in the post town in "2312 BK LEIDEN". The address in Luxembourg is also misspelled (Longway -> Longwy) but that's off-topic. Your link gives me the idea of writing semantic rules to check address formating in various countries. Examples of rules for checking addresses in France: - house number should be before street name - postal code should be before city name - postal code should be 5 digits without space (29200 is ok, 29 200 is wrong) - etc. Good example: 23 Rue de l’église 29200 BREST FRANCE Bad example (postal code after city name): 23 Rue de l’église BREST 29200 FRANCE The feature will be great for such rules. Something like this may work (no tested) \b(Rue|Avenue|Av\.|Place|Pl\.|Boulvevard|Boul\.)\s.*\n\s+\d{5}\s+\p{Lu}.*\n\s+FRANCE\b > I did not mean that you should not preprocess text. I meant that you should > not mess with the meaning of a regexp. > > Possibly, we can solve the conflict by having 2 types of : > > That would be ideal in my opinion. Use of "exact-meaning" would be very rare. Maybe a better name: Regards Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-09 07:32, Dominique Pellé wrote: > I suppose that I care more than most because I only use LT to check > text files where the situation is frequent. I think normalizing the text makes sense if: 1) single line breaks get removed from plain text files (but not double spaces) 2) this normalization doesn't happen in LT core, but in the command-line client My understanding is that's not enough for your use case as you use spaces for indentation? For me, this sounds like a general input format issue, just like people want to use LT to check LaTeX. We cannot support that in the core, but if we find a way to do it outside that would be okay for me. We just need to avoid becoming a parser for every format out there. We already have the concept of annotated text[1], I think this could be used to check plain text files. "\n" is then markup just like "" is markup in XML. So we don't need normalization in that sense, but we need to parse the input. [1] https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: new syntax available
I understand why you want to preprocess text. Sometimes, I have a similar problem. Sometimes, I want to ignore multiple spaces, line breaks, and tab characters. However, automatically ignoring such text could cause problems. For example, not all double spaces are errors. For the Netherlands, "there should be a double space between the postcode and the post town" (http://www.royalmail.com/personal/help-and-support/Addressing-your-items-We stern-Europe). I did not mean that you should not preprocess text. I meant that you should not mess with the meaning of a regexp. Possibly, we can solve the conflict by having 2 types of : Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Dominique Pellé [mailto:dominique.pe...@gmail.com] Sent: 09 October 2015 06:33 Mike Unwalla wrote: > I agree with Purodha. Do not be 'smart'. Do not change the meaning of a regexp. > > Regards, > > Mike Unwalla OK. It looks like the majority does not want to pre-processs the sentence to remove consecutive spaces (including tabs, dos/unix new lines, form feeds, vertical space, non breaking space) before matching the regexp. So I will go with that. On the other other hand, nobody indicates how to avoid the regression. A line break for example in between words, typically doesn't happens in LIbreOffice documents or in our tests, but often happens in text files. In emails, line breaks are used to avoid lines longer than ~80 char. Taking the German rule GIRLS_DAY for example, it will now fail to match when "girl's day" is on a broken line as in this sentence. I see this as a severe regression. Regards Dominique -- -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Mike Unwalla wrote: > I agree with Purodha. Do not be 'smart'. Do not change the meaning of a > regexp. > > Regards, > > Mike Unwalla OK. It looks like the majority does not want to pre-processs the sentence to remove consecutive spaces (including tabs, dos/unix new lines, form feeds, vertical space, non breaking space) before matching the regexp. So I will go with that. On the other other hand, nobody indicates how to avoid the regression. A line break for example in between words, typically doesn't happens in LIbreOffice documents or in our tests, but often happens in text files. In emails, line breaks are used to avoid lines longer than ~80 char. Taking the German rule GIRLS_DAY for example, it will now fail to match when "girl's day" is on a broken line as in this sentence. I see this as a severe regression. I suppose that I care more than most because I only use LT to check text files where the situation is frequent. For grammar.xml files that I maintain (br, eo, fr), I will use \s+ or even [\sxA0]+ in the regexp to make it work. But I can change later if another solution is decided. Regards Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: new syntax available
I agree with Purodha. Do not be 'smart'. Do not change the meaning of a regexp. Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Purodha Blissenbach [mailto:puro...@blissenbach.org] > > 1) should we write regex like foo\s+bar > > 2) or should be smart and automatically treat > all sequences of spaces/tabs/newlines/unbreakable spaces > as if it was one space? I suggest version 1, since 2 would alter the usual meaning of regular expressions which I believe is a bad idea. Purodha -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Le 07/10/2015 19:39, Dominique Pellé a écrit : > Perhaps Oliver R. in CC (author of Grammalecte) can comment on > whether there is an implicit \b at beginning and end of regexps. > Is the format of Grammalecte rules documented? In Grammalecte, word boundaries are explicit. The tags [Word], [word], [Char], [char] are commands to describe the behaviour of following rules. [Word] and [word] mean that word boundaries will added to every regexes of following rules. [Char] and [char] mean that no word boundaries are added to the following rules. [Word] and [Char] mean that rules are case insensitive. [word] and [char] mean that rules are case sensitive. But that’s the old way. In the new beta of Grammalecte (0.5.0b), word boundaries are still explicit, but it’s easier to set parameters for casing and word boundaries. At the beginning of each rule, there is tags for parameters and options. __[i]__ Word boundaries on both side. Case insensitive. No word boundaries. Case sensitive. __[u>__ Word boundary on left side only. Uppercase if you can. __ etc. # Un seul point après « etc. » is written now: __[i>/typo__ etc([.][.][.]|…) -> etc. # Un seul point après « etc. » and [Word] __tu__ science fiction -> science-fiction # Il manque… is written now: __[i]/tu__ science fiction -> science-fiction # Il manque… HTH. Regards, Olivier -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > On 2015-10-08 06:59, Dominique Pellé wrote: >> ... then the regexp rule does not detect all the errors >> that the rule detected. It does not detect errors >> in "foo bar" (2 spaces or more, or tabs) or when there is a >> new line as in: >> >> foo >> bar >> >> How to fix it? > > I don't think it should be fixed, as two consecutive spaces it usually > an error that should be fixed first. The double space is meant to be caught by another rule. I have text files that are indented with spaces or sometimes justified and so may use several spaces and they also contain new lines in the middle of sentences. For such files, I disable rule WHITESPACE_RULE, but I still want to catch other errors like "foo bar" as in my example that were caught before when using ... > Using \s+ for all spaces makes the > regex very difficult to read. I agree: it clutters the regexp, especially if it's in many places, which is the opposite of the the goal of was precisely to make it easier to maintain. That's why I also proposed solution 2). Purodha Blissenbach wrote: > I suggest version 1, since 2 would alter the usual > meaning of regular expressions which I believe is > a bad idea. No necessarily. The regexp could still be the unmodified regex. It's the sentence that can be pre-processed before matching it to replace all sequences of consecutive spaces (spaces, tabs, new line and even other Unicode spaces) with a single space. So the regexp ends up being matched against "foo bar" (1 space) instead of "foo bar" (2 or more spaces). Thinking further about it, this would be my preference. Regards Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-08 06:59, Dominique Pellé wrote: > ... then the regexp rule does not detect all the errors > that the rule detected. It does not detect errors > in "foo bar" (2 spaces or more, or tabs) or when there is a > new line as in: > > foo > bar > > How to fix it? I don't think it should be fixed, as two consecutive spaces it usually an error that should be fixed first. Using \s+ for all spaces makes the regex very difficult to read. Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 08.10.2015 06:59, Dominique Pellé wrote: > Daniel Naber wrote: > >> On 2015-10-07 06:41, Dominique Pellé wrote: >> >> Hi Dominique, >> >> thanks for your feedback. > > One more remark: > > If I replace a rule like... > > > foo > bar > > > ... into ... > > foo bar > > ... then the regexp rule does not detect all the errors > that the rule detected. It does not detect errors > in "foo bar" (2 spaces or more, or tabs) or when there is a > new line as in: > > foo > bar > > How to fix it? > > 1) should we write regex like foo\s+bar > > 2) or should be smart and automatically treat > all sequences of spaces/tabs/newlines/unbreakable spaces > as if it was one space? I suggest version 1, since 2 would alter the usual meaning of regular expressions which I believe is a bad idea. Purodha -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > On 2015-10-07 06:41, Dominique Pellé wrote: > > Hi Dominique, > > thanks for your feedback. One more remark: If I replace a rule like... foo bar ... into ... foo bar ... then the regexp rule does not detect all the errors that the rule detected. It does not detect errors in "foo bar" (2 spaces or more, or tabs) or when there is a new line as in: foo bar How to fix it? 1) should we write regex like foo\s+bar 2) or should be smart and automatically treat all sequences of spaces/tabs/newlines/unbreakable spaces as if it was one space? Regards Dominique -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > On 2015-10-07 06:41, Dominique Pellé wrote: > > Hi Dominique, > > thanks for your feedback. > >> 1) How do I highlight only a subset of the match? Trying the above >> rule, I see this: > > That's not yet possible, but I like the idea of a 'marker' attribute. > I'll add that to my TODO list. Good. >> 2) Is there always an implicit word boundary at the beginning or end >> of ? > > There's no implicit boundary. How does Grammalecte deal with this? I'm not sure. I see rules like this with explicit \b: __typo__ \betc([.][.][.]|…) -> etc. # Un seul point après « etc. » On the other hand, most rules are without \b like this: __tu__ science fiction -> science-fiction # Il manque un trait d’union. Perhaps Oliver R. in CC (author of Grammalecte) can comment on whether there is an implicit \b at beginning and end of regexps. Is the format of Grammalecte rules documented? I think that the best for LT would be to use \b implicitly at beginning and end of the regexp, but have a option to disable it with which will rarely need to be used. I can't think of a good short name for that option. >> I wonder whether there is a performance impact. > > I just ran a performance test and changing 320 German rules to regex > makes checking ~10% slower. For me, 10% is not a value I care about, > especially as other languages like English are much slower anyway. OK. 10% isn't that small in my opinion. I'll probably end up using only when it helps to reduce >= 2 rules into 1 rule, mostly because it makes grammar.xml more maintainable. Maybe having less rules will then even compensate the slowdown due to regexp matching on sentences. The slow down could depend on the text you check: I'd expect it to be worse on very long phrases if Java DFA regexp engine is worse than O(n) for some regexp, where n is the number of char in the matched sentence. But if regexp are simple enough, they will not trigger complexity worse than O(n) I suspect. Regards Dominique -- Full-scale, agent-less Infrastructure Monitoring from a single dashboard Integrate with 40+ ManageEngine ITSM Solutions for complete visibility Physical-Virtual-Cloud Infrastructure monitoring from one console Real user monitoring with APM Insights and performance trend reports Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140 ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
On 2015-10-07 06:41, Dominique Pellé wrote: Hi Dominique, thanks for your feedback. > 1) How do I highlight only a subset of the match? Trying the above > rule, I see this: That's not yet possible, but I like the idea of a 'marker' attribute. I'll add that to my TODO list. > 2) Is there always an implicit word boundary at the beginning or end > of ? There's no implicit boundary. How does Grammalecte deal with this? > I wonder whether there is a performance impact. I just ran a performance test and changing 320 German rules to regex makes checking ~10% slower. For me, 10% is not a value I care about, especially as other languages like English are much slower anyway. Regards Daniel -- Full-scale, agent-less Infrastructure Monitoring from a single dashboard Integrate with 40+ ManageEngine ITSM Solutions for complete visibility Physical-Virtual-Cloud Infrastructure monitoring from one console Real user monitoring with APM Insights and performance trend reports Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140 ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Daniel Naber wrote: > Hi, > > there's now a first and limited implementation of the syntax in > master. Instead of > > foo > > you can now use > > foo > > But be aware that this is a real regular expression that ignores tokens, > so it matches anything with the substring 'foo'. Also, the regular > expression is case-insensitive by default. You can have a look at the > German grammar.xml for many examples. > > To make use of these, you can adapt and run RuleSimplifier in the dev > package. It tries to convert simple rules automatically, but it's just a > hack, the new rules need to be tested and adapted manually. It also only > touches rules without '' elements. There's no for > regexp, it's always the complete match that will be underlined. You > obviously cannot use the regex to access the part-of-speech tags of the > match. But replacements are also limited, e.g. changing case currently > doesn't work. By using \1 you can access the first matching group, i.e. > the first parenthesis group of the regexp etc. > > Please let me know how this works for you. > > Regards > Daniel Hi Daniel First of all, thanks for implementing it. But I have questions or remarks :-) To me, the idea of is useful when we can replace many pattern rules into a single one, which helps to reduce the number of rules and improve maintenability of the rules. I see this example in German: &eigenname; (girl|boy)['’`´‘]s day Meinten Sie den Aktionstag \1s’ Day? Der Girl's Day findet einmal im Jahr statt. Der Boy`s Day ist ein Aktionstag. (girls|boys)[´`‘] day Meinten Sie den Aktionstag \1’ Day? &eigenname; Der Boys’ Day findet einmal im Jahr statt. Der Boys` Day ist ein Aktionstag. (girls|boys) day Meinten Sie den Aktionstag \1’ Day? Der Girls Day findet einmal im Jahr statt. Der Boys Day ist ein Aktionstag. (girls|boys)['’`´‘]day Meinten Sie den Aktionstag \1’ Day? Der Girls'Day findet einmal im Jahr statt. Der Boys`Day ist ein Aktionstag. The idea of is that it should be now possible to have a single rule instead of many rules, using something more or less like this: (girl|boy)s?[´`‘]?s? day ... I have not used and I have questions before I used it. 1) How do I highlight only a subset of the match? Trying the above rule, I see this: Line 1, column 8, Rule ID: GIRLS_DAY[1] Message: Meinten Sie den Aktionstag 'girls’ Day'? Suggestion: girls’ Day It's a girl's day. ^^ But what if I wanted to highlight only the word girl? Maybe highlighting the full pattern is OK in the above example, but where I'd like to use , I do not want to highlight the full pattern but only part of it, possibly a single word. For example in those French expressions... a nouveau -> à nouveau a plein temps -> à plein temps a rude épreuve -> à rude épreuve a vol d'oiseau -> à vol d'oiseau ... (etc, more cases in reality...) I'm thinking of creating such a rule: a (nouveau|plein temps|rude épreuve|vol d['’´`‘]oiseau) ... but how to say to hightlight/underline the word "a" only? I don't see the equivalent of How about something like this? (a) (nouveau|plein temps|rude épreuve|vol d['’´`‘]oiseau) ... where the marker="1" attribute indicates to underline the captured group #1 in the regxp, i.e. the word "a" in above example? What to underline could even possibly be a portion of word. 2) Is there always an implicit word boundary at the beginning or end of ? In the German grammar, I see this for example: Elisabeth (Selber|Selberth)\b Since I see \b at the end, it suggests that there is no implicit \b. But I don't see it at the beginning, so I wonder whether \b is needed or not. Having an attribute to to disable implicit \b could be useful sometimes. Enabling it by default would be best. 3) I see that the German grammar now uses in many rules, even for very simple patterns like: zu letzt¬ I wonder whether there is a performance impact. Here, the older way of using still seemed acceptable to me, and possibly faster (no regexp). Keep in mind that regexp matching of long phraes can be slow for some regexps. This depends on the regexp engine. DFA regexp engine should be O(n) where n is the length of the line I think, whereas NFA engines can be much, much slower. But DFA engine typically are slower to compile regexp than NFA, use more memory and have more limitation (often, no back references). See https://swtch.com/~rsc/regexp/regexp1.html Java uses a NFA regexp engine I think. So this means that regexp matching of long phrases could run into
Re: new syntax available
On 2015-10-06 22:02, Jaume Ortolà i Font wrote: > Thanks, Daniel. It is very useful. > > Do you suggest converting all simple rules to this new syntax? Do you > expect some improvement in performance? I leave that up to the maintainer of each language. The reason for this feature is that some rules can be expressed clearer and/or more compact with the new syntax. I don't expect performance improvements. Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
Thanks, Daniel. It is very useful. Do you suggest converting all simple rules to this new syntax? Do you expect some improvement in performance? Regards, Jaume 2015-10-05 14:59 GMT+02:00 Daniel Naber : > Hi, > > there's now a first and limited implementation of the syntax in > master. Instead of > > foo > > you can now use > > foo > > But be aware that this is a real regular expression that ignores tokens, > so it matches anything with the substring 'foo'. Also, the regular > expression is case-insensitive by default. You can have a look at the > German grammar.xml for many examples. > > To make use of these, you can adapt and run RuleSimplifier in the dev > package. It tries to convert simple rules automatically, but it's just a > hack, the new rules need to be tested and adapted manually. It also only > touches rules without '' elements. There's no for > regexp, it's always the complete match that will be underlined. You > obviously cannot use the regex to access the part-of-speech tags of the > match. But replacements are also limited, e.g. changing case currently > doesn't work. By using \1 you can access the first matching group, i.e. > the first parenthesis group of the regexp etc. > > Please let me know how this works for you. > > Regards > Daniel > > > > -- > ___ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: new syntax available
That's great news, Daniel! A very powerful new feature. Am 05.10.2015 14:59, schrieb Daniel Naber: > Hi, > > there's now a first and limited implementation of the syntax in > master. Instead of > > foo > > you can now use > > foo > -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
new syntax available
Hi, there's now a first and limited implementation of the syntax in master. Instead of foo you can now use foo But be aware that this is a real regular expression that ignores tokens, so it matches anything with the substring 'foo'. Also, the regular expression is case-insensitive by default. You can have a look at the German grammar.xml for many examples. To make use of these, you can adapt and run RuleSimplifier in the dev package. It tries to convert simple rules automatically, but it's just a hack, the new rules need to be tested and adapted manually. It also only touches rules without '' elements. There's no for regexp, it's always the complete match that will be underlined. You obviously cannot use the regex to access the part-of-speech tags of the match. But replacements are also limited, e.g. changing case currently doesn't work. By using \1 you can access the first matching group, i.e. the first parenthesis group of the regexp etc. Please let me know how this works for you. Regards Daniel -- ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel