Re: MultiThreadedJLanguageTool

2015-02-16 Thread R.J. Baars
Great performance achievement! > I've pushed a new branch "multithreading" into git. There are 3 > changes right now: > 1) Don't recreate thread pool > 2) Analyze sentences in threads > 3) Optimize some code on main thread (as all coordination goes through > a main thread it is a bottleneck and an

Re: Help with unmunch and Icelandic + Galician

2015-01-13 Thread R.J. Baars
By the way, don't trust Google too much. There are words that are valid, but too infrequent for Google to absorb in their indexes. For Dutch, I found lots of words in documents found using Google, contianing words that will not result in Google showing the same document when searching with the word

Re: added.txt activated for most languages

2014-12-23 Thread R.J. Baars
2 option, override or add. Those could be in oen file using an indicator, or in two file. Does not matter much. Ruud > On 2014-12-22 22:51, Jaume Ortolà i Font wrote: > >> I use the manual-tagger not only as a way to add new words and tags, >> but also as a means of fixing tags temporarily until

Re: Fwd: [GWA:568] Release of open Dutch Wordnet

2014-12-01 Thread R.J. Baars
Thanks. I was promised to be the first to be informed ;-) I have been waiting for this for about 5 years. Ruud > > Maybe not directly relevant for LT, but interesting... > > Original Message > Subject: [GWA:568] Release of open Dutch Wordnet > Date: 2014-12-01 20:56 > From

Re: Proofing Tool GUI -> Icelandic + Galician

2014-11-15 Thread R.J. Baars
A (difficult) example for the flag long could be the Dutch file, by OpenTaal (largely my work): http://www.opentaal.org/bestanden/doc_download/19-woordenlijst-v-210g-voor-mozilla-producten Ruud > In the AFFIX file, default flag is just 1 char. > When the clause FLAG num is in the file, the flag

Re: Proofing Tool GUI -> Icelandic + Galician

2014-11-15 Thread R.J. Baars
In the AFFIX file, default flag is just 1 char. When the clause FLAG num is in the file, the flags are numbers in the 2-byte range, from 1 upt to 65535, separated by a comma (1,2,3,4,555) When the clause FLAG long is in the file, the flags are two chars long, which also translates into 2 bytes int

Re: Help with unmunch and Icelandic + Galician

2014-11-14 Thread R.J. Baars
Continuation flags can also be used for 'compounding'and have the same issue of possibly having an endless loop. I guess that is why Hunspell is time-limited for every lookup. Ruud > 2014-11-05 10:49 GMT+01:00 R.J. Baars : > >> There will never be a new unmunch that sup

Re: Help with unmunch and Icelandic + Galician

2014-11-05 Thread R.J. Baars
Like I said, Tatoeba is much too small. There will never be a new unmunch that supports all new Hunspell functions, since the compounding (or continuation, which is much the same) makes a list unlimited of size. Ruud > On 2014-11-04 13:29, R.J. Baars wrote: > >> I put a scrip

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
I got 2.7 Mb, 229699 lines. Try to download again and give it another try. Ruud > On 2014-11-04 14:10, Adrián Chaves Fernández wrote: > >> I have not read the whole conversation, but for Galician I recently >> needed to unmunch the Hunspell files to generate a Morfologik >> dictionary, and I m

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
>> https://github.com/eitsl/hunspell/blob/master/utils/unmunch.sh >> >> A script which I found at: >> >> https://github.com/kscanne/hunspell-gd/blob/master/unmunch.sh >> >> 2014-11-04 13:29 GMT+01:00 R.J. Baars : >> >> Daniel, >>

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
Daniel, I put a script generating icelandic and the data here: www.taaltik.nl/daniel/ice.zip Read the script ice.sh to see how it works. I might give a try for Galician as well. Ruud -- _

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
That suggestion does not work for Icelanic. > > I could upload a result, but you needed it to come from sources like > Tatoeba and Wikipedia. I have no export routines for those, and currently > no time to make them. > > Maybe in a few weeks. > Ruud > >> On 2014

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
I could upload a result, but you needed it to come from sources like Tatoeba and Wikipedia. I have no export routines for those, and currently no time to make them. Maybe in a few weeks. Ruud > On 2014-11-02 11:30, R.J. Baars wrote: > >> The most effective way to generate Icelandic

Re: Question about Spanish language

2014-11-03 Thread R.J. Baars
gt; Note: > There isn't rr- at the beginning of any word but there is word with l- > or ll- (legar/llegar) > > Is this what are you questioning? > > > > 2014-11-03 14:40 GMT+01:00 R.J. Baars : >> It may b a bit off-topic, but does anyone here know the answer to this >&g

Re: Question about Spanish language

2014-11-03 Thread R.J. Baars
l > (imperative of "salir") + le (pronoun) = salle (!). That's a doubtfull > spelling, as "salle" is also a form of verb "sallar". > > LL and CH (but not RR) used to be considered "letters" and appeared as > such > in the Spanish alphabet. T

Re: Question about Spanish language

2014-11-03 Thread R.J. Baars
27;s a doubtfull > spelling, as "salle" is also a form of verb "sallar". > > LL and CH (but not RR) used to be considered "letters" and appeared as > such > in the Spanish alphabet. This practice was abandonend in 1992. > > Regards, > Jaume

Question about Spanish language

2014-11-03 Thread R.J. Baars
It may b a bit off-topic, but does anyone here know the answer to this question? Spanish has the double letters LL and RR. Does that mean that every LL and RR is a double letter, or is it possible these are 2 single characters languagewise? Ruud ---

Re: Help with unmunch and Icelandic + Galician

2014-11-02 Thread R.J. Baars
Daniel, The most effective way to generate Icelandic is to throw a large words list to Hunspell, since the dictionary is supporting compounding. Just applying the bag of trick results in 0.8 MB of words, using a large words list 2.8 MB. Quite a difference. Ruud

Re: Applying matched token's POS tag to another matched token

2014-10-31 Thread R.J. Baars
I think this could be done in the disambiguator. Ruud > Currently it's not possible. I have need it too sometimes. > > Regards, > Jaume Ortolà > > > 2014-10-30 17:37 GMT+01:00 Linas Valiukas : > >> Hi there, >> >> LanguageTool seems to provide an ability to apply POS tag of a match to >> a >> w

Re: Help with unmunch and Icelandic + Galician

2014-10-30 Thread R.J. Baars
part maintained? (By the way, I tried to convert the affix file to single char flags by the way, but there are not enough chars available to convert all flags.) Ruud > On 2014-10-30 15:08, R.J. Baars wrote: > >> My bag of trick is still running. So there might still be a good res

Re: Help with unmunch and Icelandic + Galician

2014-10-30 Thread R.J. Baars
Daniel, My bag of trick is still running. So there might still be a good result after some time. I estimate it to take another week. I noticed Icelandic seems to be a compounding language, at least parts of it. The words list is not at all encoded like that. I am tempted to rearrange the spellch

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
Yes, and flag num means any valid number. FLAG LONG makes it possible to longer (string) flags. Ruud > On 2014-10-28 12:58, Marco A.G.Pinto wrote: > >> I believe that if I change the code of Proofing Tool GUI to have >> numbers with more than one character I would break other dictionaries >> :

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
numbers into a letter or such. O:-) > > In this case, there isn't much I can do. > > Kind regards from your friend, > >Marco A.G.Pinto >-- > > > On 28/10/2014 11:13, R.J. Baars wrote: >> I edited the .aff so that is

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
I edited the .aff so that is at least does no longer crash. Look like it has been edited wit an editor inserting tabs wherever. Since tab is a special char to Hunspell, it causes the dump when unmunching. The new aff does not dump, but still adds / to words. Looks like unmunch is not able to proce

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
ong result you get in the extracted list, you can check here > if it is a rule or a tool bug. > > Thanks! > > Kind regards, > >Marco A.G.Pinto > -- > > > > On 27/10/2014 20:46, R.J. Baars wrote: >> The first thing I notic

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
e free time in January. > > Thanks! > > Kind regards from your friend, > >Marco A.G.Pinto > --- > > On 27/10/2014 20:02, R.J. Baars wrote: >> In the output of the tool are also unmunch errors. >> >> Ab0 as the derivative if Abel e

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
In the output of the tool are also unmunch errors. Ab0 as the derivative if Abel e.g. After exporting and processing into a words list, out of the 2.7 Mb, 2.3 Mb was accepted as a correct word by the same spellchecker. So the 'bag of trics' might still be useful after unmunching using this tool,

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
The tool seems to work. I will check if it is better than the bag of trick.. Looks very promising. Requires further processing though. Ruud > You have to use V3.0 build 64. From the menu "Dictionary Tools", choose > "Extract wordlist". It worked for me. > > Am 27.10.2014 16:38, schrieb Daniel N

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Below is the full bag of tricks: #!/bin/bash # set the language id (name of hunspell dic without extension) if [ ! $1 ] ; then echo "ENTER THE NAME OF THE DICTIONARY FILE WITHOUT .DIC AS A PARAMTER" else if [ -f $1.dic ] ; then if [ -f $1.aff ] ; then LANG=$1 # try to unmunch

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Apart from the trick I am applying now, a good option for more valid output could be to use the words form Wikipedia and Tatoeba as an extra input. If the language is in those databases. Galician grew to > 3 Mb fast enough when Spanish and Portuguese were used as input. These could also be found i

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
n the hunspell stuff, for both Icelandic and Galician. It will take time however, since generating suggestions is slow. I create a simple Bash file to do the entire process as well. If that one generates a workable list, I will supply that as well. Ruud > On 2014-10-27 11:37, R.J. Baa

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
> > I think you will not have to maintain those lists at all. You could just > try to get the sources if they are still being maintained. If it is no > longer maintained, a new maintainer will have a good start by having a > words list and word frequencies, not just Hunspell codings.

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
is no longer maintained, a new maintainer will have a good start by having a words list and word frequencies, not just Hunspell codings. Ruud > On 2014-10-27 10:53, R.J. Baars wrote: > >> I first changed it into utf-8; >> I removed the po: flags >> I changed the tab cha

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
dictionaries editable online, and generate the words lists from that database? Ruud > On 2014-10-27 10:26, R.J. Baars wrote: > >> I was able to make a file though. It is 3 Mb uncompressed. >> >> You can download it from dev.taaltik.nl/is.okay.zip > > Thanks, what was the ex

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
> On 2014-10-27 10:26, R.J. Baars wrote: > >> I was able to make a file though. It is 3 Mb uncompressed. >> >> You can download it from dev.taaltik.nl/is.okay.zip > > Thanks, what was the exact command you used to create this list? Multiple. And manual editing. I

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Icelandig really create a lot of junk using unmunch, even after removing some newer attributes form the .dict. Looks like unmunch is not capable of using the number flags as well. I was able to make a file though. It is 3 Mb uncompressed. You can download it from dev.taaltik.nl/is.okay.zip Ruud

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Unmunch does not support the newer functionalities of Hunspell. It might generate rubbish even. There are ways to do this, more or less. Generating the list using unmunch is still an option, even when it generates rubbish. Add a list of found Icelandic words to that list. The use hunspell with th

Re: Case sensitivity in MultiWordChunker

2014-10-26 Thread R.J. Baars
What does Multiwordchunker do? > 2014-10-24 20:27 GMT+02:00 Andriy Rysin : > >> Was it by design that MultiWordChunker is case sensitive and we need >> to duplicate most of the lines for lower and upper cases? >> Would it make sense to add a flag setCaseSensitive() to make it >> automatic? >> > >

Re: British English: -IZE/-ISE

2014-10-17 Thread R.J. Baars
Same kind of issues exist in French (reform), Dutch ('green' and 'white' spelling') and Portuguese. Ruud > Hello! > > Since some British people complain that they prefer -ISE, others prefer > -IZE and others both, I was wondering if we could add a setting to LT > regarding that. > > This way, one

Re: svn issue (solved)

2014-10-16 Thread R.J. Baars
Somehow, the issue disappeared spontaneously. Ruud > Somehow I suddenly could not commit anything to LT anymore. > > I tried to check it out agina, multiple times, but with no result, excep > tfro this message: > > Filling log cache in background > The PROPFIND response contains invalid XML (207

svn issue

2014-10-16 Thread R.J. Baars
Somehow I suddenly could not commit anything to LT anymore. I tried to check it out agina, multiple times, but with no result, excep tfro this message: Filling log cache in background The PROPFIND response contains invalid XML (207 Multi-Status) Filling log cache in background finished. It makes

morfologik speller

2014-10-16 Thread R.J. Baars
In my test environment, I get reports of words not known by the mofologik speller, that are quite normal words apperently. Is that because there could be a non-visible character in those words, like a soft hyphen? Ruud ---

translation

2014-10-16 Thread R.J. Baars
Daniel, when there is no error using the cheking form on the Dutch section of the site, the result is: No errors were found. The translation is : Geen aandachtspunten gevonden. Ruud -- Comprehensive Server Monitoring

frequency lists

2014-10-14 Thread R.J. Baars
After conferring a bit more with Daniel, I decided to make my company to publish the top 30% of the frequency lists free and open using CC-BY. This should be enough for LT. If you want to add frequencies to the morfologik speller, the frequency list for your language could be in the complete set o

Re: words frequencies

2014-10-13 Thread R.J. Baars
ut written consent of the owner (me). In fact, I would object to any use except for open and free purposes. Is there a license that fits that? Ruud > On 2014-10-14 08:26, R.J. Baars wrote: > >> the 'gaia' format: >> www.spellonit.com/downloads/frequencies/_gaia.xml

Re: words frequencies

2014-10-13 Thread R.J. Baars
The frequency lists are now available. You can find yours here: the 'gaia' format: www.spellonit.com/downloads/frequencies/_gaia.xml.zip the plain csv: www.spellonit.com/downloads/frequencies/_wordfreqs.csv.zip Ruud -

words frequencies

2014-10-13 Thread R.J. Baars
I am currently exporting word frequencies for all languages I have collected over the years. These frequency lists are 'dirty', which means there has been done no check if words are correct. That will be handled by the the speller anyway. Spell checker maintainers could also use it for input.. Th

Re: API now always up-to-date

2014-10-13 Thread R.J. Baars
Great! > Hi, > > I've modified our snapshot creation script so that it automatically > deploys the snapshot as our HTTP API server. This API is also used by > the check on www.languagetool.org, so the website now always uses the > latest snapshot of LT. Updates happen once a day. If tests fail, th

Re: switching from Hunspell to Morfologik

2014-10-11 Thread R.J. Baars
For most of those languages, the frequency files I made are a lot more extensive than those on the gaia site. If you need them, just tell me. I can easily convert my frequency list to the gaia format. Ruud > Hi, > > to provide LT as a 100% pure Java software, I'd like to switch from > Hunspell (

Re: Allow maximum of one match within rule group

2014-10-09 Thread R.J. Baars
You could use rule a as an antipattern for rule b and vice versa. Ruud > Hi, > > I have the following rule group containing two rules to catch the error in > two different situations. > > The problem is sometimes both rules will match, meaning there will be two > error messages which say the same

edited english date rule

2014-10-08 Thread R.J. Baars
I edited the English date rule cluster, uncommenting a rule and adding antipattern to remove ambiguous cases. Ruud -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status w

Re: date checks

2014-10-08 Thread R.J. Baars
> On 2014-10-08 14:34, R.J. Baars wrote: > >> I don't get the drift of this piece of code: >> Apperently it translated the month string into a number. But is it used >> for 'mar' as well as 'march'? > > It's used for anything your rule

date checks

2014-10-08 Thread R.J. Baars
I don't get the drift of this piece of code: Apperently it translated the month string into a number. But is it used for 'mar' as well as 'march'? Why is not the full month and/or abbreviation used? (In Dutch, march is abbreviated as mar often, or mrt..) Wouldn't it be better to use a regexp (maa

Re: improving LT coverage

2014-10-08 Thread R.J. Baars
I am trying to pursue 2 different approaches: 1) getting all valid sentence patterns by using 'explosion' algorithms to replace all kind of phrase with another: 2) detecting most used sentence patterns from the corpus by replacing words with just 1 postag by the postag and counting sentence occur

Re: And an issue, for Dutch ...

2014-10-08 Thread R.J. Baars
You mean the one below. That one uses a different class DMYDateCheckFilter.. The only change needed to get it into Dutch too is changing fr into nl in just one place. Could you please do that, Dominique? Ruud -- Mee

Re: date checks

2014-10-08 Thread R.J. Baars
I don't feel comfortable doing that yet. Ruud > On 2014-10-07 16:56, R.J. Baars wrote: > >> Daniel, one of the date checks was commented out. >> >> I think it could still be of use, if the ambiguous items were removed, >> e.g. using antipattern. > >

And an issue, for Dutch ...

2014-10-07 Thread R.J. Baars
A long time ago, I chose to have the - as a word char, not separating word parts that really belong together. That is now in the way for the date rules, since a normal date in Dutch can also be 15-1-1958. Is there a solution for this issue? Like tokenizing when the dash is within a number? Or get

Solution

2014-10-07 Thread R.J. Baars
The ambiguous date rule can indeed be resolved using antipattern: &weekdays; 0{0,1}[1-9]|1[012] / 0{0,1}[1-9]|1[012] / \d\d\d\d

Re: special date stuff in english grammar file

2014-10-07 Thread R.J. Baars
So this is a kind of 'macro'.. Good to know that exists. Might come in handy for some type of exceptions.. Ruud > On 2014-10-07 18:39, R.J. Baars wrote: > >> Will the entries below be necessary to have the datechecker working? >> >> Ruud >> >>

special date stuff in english grammar file

2014-10-07 Thread R.J. Baars
Will the entries below be necessary to have the datechecker working? Ruud ]> -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box P

date checks

2014-10-07 Thread R.J. Baars
Daniel, one of the date checks was commented out. I think it could still be of use, if the ambiguous items were removed, e.g. using antipattern. Somewhat like this? (Was not able to test it yet..) Ruud -

Re: looking for more semantic rules

2014-10-07 Thread R.J. Baars
I know. I will check the existing rules for EN and DE. Probably re-usable. Will get the updated build tomorrow. Ruud > On 2014-10-07 13:51, R.J. Baars wrote: > >> It was not too difficult to translate that. I attached my proposal. >> I am not able however to add it to the

wikicheck Dutch

2014-10-07 Thread R.J. Baars
It is better to disable the rule: OT_EINDE_ZIN_ONVERWACHT [1] And there is still quite a bit of output showing wiki markup. Would it be an idea to: - make a javascript component for the wiki page, interpreting the page layout (which is html then) and checking the texts from that point? Ruud

Re: looking for more semantic rules

2014-10-07 Thread R.J. Baars
About more semantic rule, what about time consistency? About the date check, I have been looking at the code, wanting to make a Dutch version, but there is no locale that fits Netherlands and Belgium; in fact there is none. Is there a way to work around that? Ruud > Hi, > > our new rule that c

Re: Duplicate entries in compounds.txt in ru, nl

2014-10-06 Thread R.J. Baars
Done for Dutch. It is not really a problem, is it? Ruud > Hi > > I've noticed that the Russian and Dutch > "compounds.txt" files contain duplicate entries. > Either the dupes should be removed, or maybe > some of the dupe were meant to be the plural > form or some other flexions. Can the languag

Re: Wikicheck issue

2014-10-06 Thread R.J. Baars
Thanks. I will keep the process in mind for next releases. (There were 60 rules added just this week, which have to be tested for soem time to be able to check the false positives...) Ruud > On 2014-10-06 20:04, R.J. Baars wrote: > >> After that 2.7 was released, but as far as I

Re: Wikicheck issue

2014-10-06 Thread R.J. Baars
stuff; I would like to see it corrected. Ruud > On 2014-10-06 18:17, R.J. Baars wrote: > >> I guess the version of LT that was delpoyed to Wikipedia for Dutch, >> contains all rules, not planned to release yet. >> >> I think it is better to replace the grammar.xml t

Wikicheck issue

2014-10-06 Thread R.J. Baars
I guess the version of LT that was delpoyed to Wikipedia for Dutch, contains all rules, not planned to release yet. I think it is better to replace the grammar.xml there asap, from the current production version. Ruud -

Morfologik speller

2014-10-03 Thread R.J. Baars
Marcin, would it be possible to use the morfologik speller as a separate program, to throw a list of words at, and get the alternatives? Is there an example program that does that? Ruud -- Meet PCI DSS 3.0 Compliance

2 tokens in 1 sentence

2014-10-03 Thread R.J. Baars
Is there a more efficient way to detect 2 tokens in one sentence or maybe in a range of tokens? The only way I know now is to make 2 rules, one word worda ... wordb and one for wordb ... worda. Ruud -- Meet PCI DSS 3.0 C

Compliments!

2014-10-03 Thread R.J. Baars
The suggestion mechanism of the Morfologik speller using word frequencies is WAY better than the suggestion mechanism for Hunspell. In fact, the first suggestion is almost all the time the right one. Well done! Ruud --

Re: unexpected ending of a sentence

2014-10-02 Thread R.J. Baars
I could only make the assumption about cells and headers being rather short... It is worth trying. Thanks. Ruud > W dniu 2014-10-02 o 08:25, R.J. Baars pisze: >> I produced a rule, signaling an unexpected end of a sentence, like a >> sentence not ending with a char like . ! or ?

unexpected ending of a sentence

2014-10-01 Thread R.J. Baars
I produced a rule, signaling an unexpected end of a sentence, like a sentence not ending with a char like . ! or ? But this is quite common to happen inside table cells or in headings. LT is not aware of these things, is it? Has anyone found a way to prevent false alarms in these header or cell c

Phrases

2014-10-01 Thread R.J. Baars
Are phrases still supported and planned to be so for a long time to come? It might be a good way to have (incorrect and correct) phrases to build error-rcatching sentences from. Ruud -- Meet PCI DSS 3.0 Compliance Requi

Re: tokenizing numbers

2014-09-30 Thread R.J. Baars
It appears you are thinking of rules, quite different than the ones I am thinking of. We will see in time ... Ruud > W dniu 2014-09-24 o 21:03, R.J. Baars pisze: >> Maybe we agree to disagree.. >> >> Having them as one token makes detecting patterns easy using regular >

Re: improved language overview

2014-09-29 Thread R.J. Baars
I would like to have TaalTik added to the contributors for Dutch, www.taaltik.nl Ruud > Hi, > > I made an improvement to our language overview page at > https://languagetool.org/languages: it now displays an activity bar, > based on the number of commits for that language in the last 6 months > (

the experiment

2014-09-28 Thread R.J. Baars
Some time ago I informed you on my experiment getting sentence patterns from the corpus. Th current status is that I was able to pinpoint the most common patterns: 8462 {NN1d}. 7316 {DTd} {NN1d} {VB3} {AJn}. 5830 {DTd} {NN1d} is {AJn}. 5710 {AJe} {NN1d}. 5641 De {NN1d} {VB3} {AJn}.

Re: morfologik speller

2014-09-28 Thread R.J. Baars
Maybe an additional idea is to use the edit distance relative to the word size as well (when no frequencies are available). A 2 letter distance in a 4 letter word is very bad, while it is of less significance in a 10-letter word. (I don know which algorithms are use right now, so I could be sugge

morfologik speller

2014-09-28 Thread R.J. Baars
For the word 'sex', (the most common mistake in Dutch), are suggested: seks; AEX; Bex; Mex; Pex; SEB; SEM; SEN; SEP; SER; Seb; Sef; Sem; Sen; Sep; ex; sax; sec; sekse; set; sexy; Dex; LEX; Lex; PEX; REX; Rex; SEF; SIX; Six; TeX; Tex; seks-; sekst; Şen Context: Dit is sex. The first one is the per

Re: Large amount of rules ...

2014-09-27 Thread R.J. Baars
since words are too different.) The other rules, multi-word ones, will take more time; some words might be acceptable as Dutch, some might not. This will make things a bit more complex. Ruud > 2014-09-27 11:06 GMT+02:00 R.J. Baars : > >> >> It is all about suggesting a Dutch w

Re: Large amount of rules ...

2014-09-27 Thread R.J. Baars
he people wanting to be strict. Ruud > 2014-09-27 11:06 GMT+02:00 R.J. Baars : > >> >> It is all about suggesting a Dutch word for a loanword. >> > > Then why don't you use a simple replace rule (in Java)? You can use the > existing o

languagetool-devel@lists.sourceforge.net

2014-09-27 Thread R.J. Baars
Okay. It is the only char so far encoded that way then. Ruud > On 2014-09-27 10:13, R.J. Baars wrote: > >> How do I get an & as token? It generates an error: >> >> &

Re: Large amount of rules ...

2014-09-27 Thread R.J. Baars
Okay. I will first have to check the results for frequency of getting triggered. They have to be different rules, maybe be in a rulegroup. It is all about suggesting a Dutch word for a loanword. Ruud > On 2014-09-26 21:53, R.J. Baars wrote: > >> Will adding 5000 rules lead to prob

languagetool-devel@lists.sourceforge.net

2014-09-27 Thread R.J. Baars
How do I get an & as token? It generates an error: & Exception in thread "main" java.io.IOException: Cannot load or parse '/org/languagetool/rules/nl/grammar.xml' at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:130) at org.languagetool.rules.patterns

daily build

2014-09-26 Thread R.J. Baars
Tonight's build like a charm; no issues with Dutch. So I will not update any file until the release has been done. Ruud -- Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Sta

Large amount of rules ...

2014-09-26 Thread R.J. Baars
I received permission, a long time ago, to use a list of loan words for rules. It is a list of almost 5000 loan words from English; it is possible to generate most rules from the file directly. Will adding 5000 rules lead to problems? (Of course I will have to check them for amount of positives;

compoundrule, once again

2014-09-26 Thread R.J. Baars
Daniel, would you consider changing the separating char in compoundrule from - into any non-word char, like ~, = ? It would help be a lot fighting the English disease, writing words apart that should be written together. Since the replacerule does not support spaces before the = (as far as I see)

Re: Just warning rule

2014-09-26 Thread R.J. Baars
Okay, I will make rules and exceptions for all of those words. (When the wrongwordincontext is not effective that is. But it is much easier to detect context with that...) Ruud > On 2014-09-26 10:57, R.J. Baars wrote: > >> Some examples: gent / Gent (bird, city) > > Are there

Just warning rule

2014-09-26 Thread R.J. Baars
There are word confusions where there is no context to go on. I have been checking some word in the wrongwordsincontext, by actually getting the words in sentences with those words, and comparing their frequencies from my corpus. Some confusions are simply without significant context differences.

Re: problem in ignore.txt ?

2014-09-26 Thread R.J. Baars
Great! > On 2014-09-25 07:54, R.J. Baars wrote: > >> I get the feeling ignore.txt might not be working correctly for the >> Dutch >> Mofologikspeller. > > That's right, there was a bug because I renamed the "hunspell" directory > to &quo

CompoundRule

2014-09-25 Thread R.J. Baars
How do I make the compoundrule suggests ouder-kindrelatie ouder-kind-relatie but not ouderkind-relatie nor ouderkindrelatie ? There are some special rules about the - in Dutch. I can also use the simplereplacerule for cases like this, But I think that is less 'elegant'. Ruud --

Re: Bug in generating spelling dictionary?

2014-09-24 Thread R.J. Baars
I might have found the solution for this; a different process was killing languagetool. Sorry to have bothered you. Ruud > Generating the spelling dictionary often goes wrong when run from bash. > It appears to just stop, not giving an error, not giving 'Done'. > (It runs okay from the command

Bug in generating spelling dictionary?

2014-09-24 Thread R.J. Baars
Generating the spelling dictionary often goes wrong when run from bash. It appears to just stop, not giving an error, not giving 'Done'. (It runs okay from the command line, but that is inconvenient because of the strange names and location of the generated files..) the command: java -cp LanguageT

problem in ignore.txt ?

2014-09-24 Thread R.J. Baars
I get the feeling ignore.txt might not be working correctly for the Dutch Mofologikspeller. In the list is 'ipv'; still it gets reported as a spelling error for Mofologikspeller. How come? Ruud Start controle in Nederlands... This is the morfologik spelling rule: 1. Regel 1, kolom 1 Melding:

Match in url

2014-09-24 Thread R.J. Baars
Would it be possible to use the matched tokens in the url? I could use that to direct users directly to more info about the word for errors in 'de' and 'het' on a website showing what it should be: http://woordenlijst.org/zoek/?q=molton Ruud --

Re: tokenizing numbers

2014-09-24 Thread R.J. Baars
gt; form and it'll be different for whole and fractional number endings... >> >> And if many documents treat dot as comma would not it make sense to >> create a rule that catches that and proposes correct format? >> >> Andriy >> >> 2014-09-24 10:53 GMT-04

Re: tokenizing numbers

2014-09-24 Thread R.J. Baars
the > language). > > Andriy > > > 2014-09-24 8:03 GMT-04:00 R.J. Baars : >> Numbers like 1.234 or 1,000.00 are tokenized into several tokens, while >> it >> is one number. >> >> What do you think about changing the tokenizer to treat them as one >&g

tokenizing numbers

2014-09-24 Thread R.J. Baars
Numbers like 1.234 or 1,000.00 are tokenized into several tokens, while it is one number. What do you think about changing the tokenizer to treat them as one number? This would maybe affect all languages having rules concerning numbers, so this is not the right time, but maybe after releasing 2.7?

quoted sentence causing false alarms

2014-09-23 Thread R.J. Baars
Sometimes an entire sentence is quoted inside another sentence, like: ‘Wat is ie groot!’ is een gevleugelde uitspraak. In these case, currently there is a false alarm on 'is' to be a sentence start without a capital. Is Dutch the only language having this? There is no real standard for quoting,

Re: reminder: feature freeze for 2.7

2014-09-22 Thread R.J. Baars
I see. You are probably thinking of all those proper names, type numbers etc. ? > On 2014-09-22 16:47, R.J. Baars wrote: > >> Is the intention to activate the spellchecker for Wikipedia now? > > No, that won't happen until we have a clear idea how to avoid false > ala

  1   2   3   4   >