Great performance achievement!
> I've pushed a new branch "multithreading" into git. There are 3
> changes right now:
> 1) Don't recreate thread pool
> 2) Analyze sentences in threads
> 3) Optimize some code on main thread (as all coordination goes through
> a main thread it is a bottleneck and an
By the way, don't trust Google too much. There are words that are valid,
but too infrequent for Google to absorb in their indexes.
For Dutch, I found lots of words in documents found using Google,
contianing words that will not result in Google showing the same document
when searching with the word
2 option, override or add. Those could be in oen file using an indicator,
or in two file.
Does not matter much.
Ruud
> On 2014-12-22 22:51, Jaume Ortolà i Font wrote:
>
>> I use the manual-tagger not only as a way to add new words and tags,
>> but also as a means of fixing tags temporarily until
Thanks. I was promised to be the first to be informed ;-)
I have been waiting for this for about 5 years.
Ruud
>
> Maybe not directly relevant for LT, but interesting...
>
> Original Message
> Subject: [GWA:568] Release of open Dutch Wordnet
> Date: 2014-12-01 20:56
> From
A (difficult) example for the flag long could be the Dutch file, by
OpenTaal (largely my work):
http://www.opentaal.org/bestanden/doc_download/19-woordenlijst-v-210g-voor-mozilla-producten
Ruud
> In the AFFIX file, default flag is just 1 char.
> When the clause FLAG num is in the file, the flag
In the AFFIX file, default flag is just 1 char.
When the clause FLAG num is in the file, the flags are numbers in the
2-byte range, from 1 upt to 65535, separated by a comma (1,2,3,4,555)
When the clause FLAG long is in the file, the flags are two chars long,
which also translates into 2 bytes int
Continuation flags can also be used for 'compounding'and have the same
issue of possibly having an endless loop.
I guess that is why Hunspell is time-limited for every lookup.
Ruud
> 2014-11-05 10:49 GMT+01:00 R.J. Baars :
>
>> There will never be a new unmunch that sup
Like I said, Tatoeba is much too small.
There will never be a new unmunch that supports all new Hunspell
functions, since the compounding (or continuation, which is much the same)
makes a list unlimited of size.
Ruud
> On 2014-11-04 13:29, R.J. Baars wrote:
>
>> I put a scrip
I got 2.7 Mb, 229699 lines.
Try to download again and give it another try.
Ruud
> On 2014-11-04 14:10, Adrián Chaves Fernández wrote:
>
>> I have not read the whole conversation, but for Galician I recently
>> needed to unmunch the Hunspell files to generate a Morfologik
>> dictionary, and I m
>> https://github.com/eitsl/hunspell/blob/master/utils/unmunch.sh
>>
>> A script which I found at:
>>
>> https://github.com/kscanne/hunspell-gd/blob/master/unmunch.sh
>>
>> 2014-11-04 13:29 GMT+01:00 R.J. Baars :
>>
>> Daniel,
>>
Daniel,
I put a script generating icelandic and the data here:
www.taaltik.nl/daniel/ice.zip
Read the script ice.sh to see how it works.
I might give a try for Galician as well.
Ruud
--
_
That suggestion does not work for Icelanic.
>
> I could upload a result, but you needed it to come from sources like
> Tatoeba and Wikipedia. I have no export routines for those, and currently
> no time to make them.
>
> Maybe in a few weeks.
> Ruud
>
>> On 2014
I could upload a result, but you needed it to come from sources like
Tatoeba and Wikipedia. I have no export routines for those, and currently
no time to make them.
Maybe in a few weeks.
Ruud
> On 2014-11-02 11:30, R.J. Baars wrote:
>
>> The most effective way to generate Icelandic
gt; Note:
> There isn't rr- at the beginning of any word but there is word with l-
> or ll- (legar/llegar)
>
> Is this what are you questioning?
>
>
>
> 2014-11-03 14:40 GMT+01:00 R.J. Baars :
>> It may b a bit off-topic, but does anyone here know the answer to this
>&g
l
> (imperative of "salir") + le (pronoun) = salle (!). That's a doubtfull
> spelling, as "salle" is also a form of verb "sallar".
>
> LL and CH (but not RR) used to be considered "letters" and appeared as
> such
> in the Spanish alphabet. T
27;s a doubtfull
> spelling, as "salle" is also a form of verb "sallar".
>
> LL and CH (but not RR) used to be considered "letters" and appeared as
> such
> in the Spanish alphabet. This practice was abandonend in 1992.
>
> Regards,
> Jaume
It may b a bit off-topic, but does anyone here know the answer to this
question?
Spanish has the double letters LL and RR. Does that mean that every LL and
RR is a double letter, or is it possible these are 2 single characters
languagewise?
Ruud
---
Daniel,
The most effective way to generate Icelandic is to throw a large words
list to Hunspell, since the dictionary is supporting compounding.
Just applying the bag of trick results in 0.8 MB of words, using a large
words list 2.8 MB. Quite a difference.
Ruud
I think this could be done in the disambiguator.
Ruud
> Currently it's not possible. I have need it too sometimes.
>
> Regards,
> Jaume OrtolÃ
>
>
> 2014-10-30 17:37 GMT+01:00 Linas Valiukas :
>
>> Hi there,
>>
>> LanguageTool seems to provide an ability to apply POS tag of a match to
>> a
>> w
part maintained?
(By the way, I tried to convert the affix file to single char flags by the
way, but there are not enough chars available to convert all flags.)
Ruud
> On 2014-10-30 15:08, R.J. Baars wrote:
>
>> My bag of trick is still running. So there might still be a good res
Daniel,
My bag of trick is still running. So there might still be a good result
after some time. I estimate it to take another week.
I noticed Icelandic seems to be a compounding language, at least parts of
it. The words list is not at all encoded like that.
I am tempted to rearrange the spellch
Yes, and flag num means any valid number.
FLAG LONG makes it possible to longer (string) flags.
Ruud
> On 2014-10-28 12:58, Marco A.G.Pinto wrote:
>
>> I believe that if I change the code of Proofing Tool GUI to have
>> numbers with more than one character I would break other dictionaries
>> :
numbers into a letter or such. O:-)
>
> In this case, there isn't much I can do.
>
> Kind regards from your friend,
> >Marco A.G.Pinto
>--
>
>
> On 28/10/2014 11:13, R.J. Baars wrote:
>> I edited the .aff so that is
I edited the .aff so that is at least does no longer crash.
Look like it has been edited wit an editor inserting tabs wherever. Since
tab is a special char to Hunspell, it causes the dump when unmunching.
The new aff does not dump, but still adds / to words.
Looks like unmunch is not able to proce
ong result you get in the extracted list, you can check here
> if it is a rule or a tool bug.
>
> Thanks!
>
> Kind regards,
> >Marco A.G.Pinto
> --
>
>
>
> On 27/10/2014 20:46, R.J. Baars wrote:
>> The first thing I notic
e free time in January.
>
> Thanks!
>
> Kind regards from your friend,
> >Marco A.G.Pinto
> ---
>
> On 27/10/2014 20:02, R.J. Baars wrote:
>> In the output of the tool are also unmunch errors.
>>
>> Ab0 as the derivative if Abel e
In the output of the tool are also unmunch errors.
Ab0 as the derivative if Abel e.g.
After exporting and processing into a words list, out of the 2.7 Mb, 2.3
Mb was accepted as a correct word by the same spellchecker.
So the 'bag of trics' might still be useful after unmunching using this
tool,
The tool seems to work.
I will check if it is better than the bag of trick.. Looks very promising.
Requires further processing though.
Ruud
> You have to use V3.0 build 64. From the menu "Dictionary Tools", choose
> "Extract wordlist". It worked for me.
>
> Am 27.10.2014 16:38, schrieb Daniel N
Below is the full bag of tricks:
#!/bin/bash
# set the language id (name of hunspell dic without extension)
if [ ! $1 ] ; then
echo "ENTER THE NAME OF THE DICTIONARY FILE WITHOUT .DIC AS A PARAMTER"
else
if [ -f $1.dic ] ; then
if [ -f $1.aff ] ; then
LANG=$1
# try to unmunch
Apart from the trick I am applying now, a good option for more valid
output could be to use the words form Wikipedia and Tatoeba as an extra
input. If the language is in those databases.
Galician grew to > 3 Mb fast enough when Spanish and Portuguese were used
as input. These could also be found i
n the hunspell stuff,
for both Icelandic and Galician.
It will take time however, since generating suggestions is slow.
I create a simple Bash file to do the entire process as well.
If that one generates a workable list, I will supply that as well.
Ruud
> On 2014-10-27 11:37, R.J. Baa
>
> I think you will not have to maintain those lists at all. You could just
> try to get the sources if they are still being maintained. If it is no
> longer maintained, a new maintainer will have a good start by having a
> words list and word frequencies, not just Hunspell codings.
is no
longer maintained, a new maintainer will have a good start by having a
words list and word frequencies, not just Hunspell codings.
Ruud
> On 2014-10-27 10:53, R.J. Baars wrote:
>
>> I first changed it into utf-8;
>> I removed the po: flags
>> I changed the tab cha
dictionaries
editable online, and generate the words lists from that database?
Ruud
> On 2014-10-27 10:26, R.J. Baars wrote:
>
>> I was able to make a file though. It is 3 Mb uncompressed.
>>
>> You can download it from dev.taaltik.nl/is.okay.zip
>
> Thanks, what was the ex
> On 2014-10-27 10:26, R.J. Baars wrote:
>
>> I was able to make a file though. It is 3 Mb uncompressed.
>>
>> You can download it from dev.taaltik.nl/is.okay.zip
>
> Thanks, what was the exact command you used to create this list?
Multiple. And manual editing.
I
Icelandig really create a lot of junk using unmunch, even after removing
some newer attributes form the .dict.
Looks like unmunch is not capable of using the number flags as well.
I was able to make a file though. It is 3 Mb uncompressed.
You can download it from dev.taaltik.nl/is.okay.zip
Ruud
Unmunch does not support the newer functionalities of Hunspell. It might
generate rubbish even.
There are ways to do this, more or less.
Generating the list using unmunch is still an option, even when it
generates rubbish. Add a list of found Icelandic words to that list.
The use hunspell with th
What does Multiwordchunker do?
> 2014-10-24 20:27 GMT+02:00 Andriy Rysin :
>
>> Was it by design that MultiWordChunker is case sensitive and we need
>> to duplicate most of the lines for lower and upper cases?
>> Would it make sense to add a flag setCaseSensitive() to make it
>> automatic?
>>
>
>
Same kind of issues exist in French (reform), Dutch ('green' and 'white'
spelling') and Portuguese.
Ruud
> Hello!
>
> Since some British people complain that they prefer -ISE, others prefer
> -IZE and others both, I was wondering if we could add a setting to LT
> regarding that.
>
> This way, one
Somehow, the issue disappeared spontaneously.
Ruud
> Somehow I suddenly could not commit anything to LT anymore.
>
> I tried to check it out agina, multiple times, but with no result, excep
> tfro this message:
>
> Filling log cache in background
> The PROPFIND response contains invalid XML (207
Somehow I suddenly could not commit anything to LT anymore.
I tried to check it out agina, multiple times, but with no result, excep
tfro this message:
Filling log cache in background
The PROPFIND response contains invalid XML (207 Multi-Status)
Filling log cache in background finished.
It makes
In my test environment, I get reports of words not known by the mofologik
speller, that are quite normal words apperently.
Is that because there could be a non-visible character in those words,
like a soft hyphen?
Ruud
---
Daniel, when there is no error using the cheking form on the Dutch
section of the site, the result is:
No errors were found.
The translation is :
Geen aandachtspunten gevonden.
Ruud
--
Comprehensive Server Monitoring
After conferring a bit more with Daniel, I decided to make my company to
publish the top 30% of the frequency lists free and open using CC-BY.
This should be enough for LT.
If you want to add frequencies to the morfologik speller, the frequency
list for your language could be in the complete set o
ut written consent of the
owner (me). In fact, I would object to any use except for open and free
purposes.
Is there a license that fits that?
Ruud
> On 2014-10-14 08:26, R.J. Baars wrote:
>
>> the 'gaia' format:
>> www.spellonit.com/downloads/frequencies/_gaia.xml
The frequency lists are now available.
You can find yours here:
the 'gaia' format:
www.spellonit.com/downloads/frequencies/_gaia.xml.zip
the plain csv:
www.spellonit.com/downloads/frequencies/_wordfreqs.csv.zip
Ruud
-
I am currently exporting word frequencies for all languages I have
collected over the years.
These frequency lists are 'dirty', which means there has been done no
check if words are correct.
That will be handled by the the speller anyway. Spell checker maintainers
could also use it for input..
Th
Great!
> Hi,
>
> I've modified our snapshot creation script so that it automatically
> deploys the snapshot as our HTTP API server. This API is also used by
> the check on www.languagetool.org, so the website now always uses the
> latest snapshot of LT. Updates happen once a day. If tests fail, th
For most of those languages, the frequency files I made are a lot more
extensive than those on the gaia site.
If you need them, just tell me. I can easily convert my frequency list to
the gaia format.
Ruud
> Hi,
>
> to provide LT as a 100% pure Java software, I'd like to switch from
> Hunspell (
You could use rule a as an antipattern for rule b and vice versa.
Ruud
> Hi,
>
> I have the following rule group containing two rules to catch the error in
> two different situations.
>
> The problem is sometimes both rules will match, meaning there will be two
> error messages which say the same
I edited the English date rule cluster, uncommenting a rule and adding
antipattern to remove ambiguous cases.
Ruud
--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status w
> On 2014-10-08 14:34, R.J. Baars wrote:
>
>> I don't get the drift of this piece of code:
>> Apperently it translated the month string into a number. But is it used
>> for 'mar' as well as 'march'?
>
> It's used for anything your rule
I don't get the drift of this piece of code:
Apperently it translated the month string into a number. But is it used
for 'mar' as well as 'march'?
Why is not the full month and/or abbreviation used?
(In Dutch, march is abbreviated as mar often, or mrt..)
Wouldn't it be better to use a regexp (maa
I am trying to pursue 2 different approaches:
1) getting all valid sentence patterns by using 'explosion' algorithms to
replace all kind of phrase with another:
2) detecting most used sentence patterns from the corpus by replacing
words with just 1 postag by the postag and counting sentence occur
You mean the one below. That one uses a different class DMYDateCheckFilter..
The only change needed to get it into Dutch too is changing fr into nl in
just one place.
Could you please do that, Dominique?
Ruud
--
Mee
I don't feel comfortable doing that yet.
Ruud
> On 2014-10-07 16:56, R.J. Baars wrote:
>
>> Daniel, one of the date checks was commented out.
>>
>> I think it could still be of use, if the ambiguous items were removed,
>> e.g. using antipattern.
>
>
A long time ago, I chose to have the - as a word char, not separating word
parts that really belong together.
That is now in the way for the date rules, since a normal date in Dutch
can also be 15-1-1958.
Is there a solution for this issue? Like tokenizing when the dash is
within a number? Or get
The ambiguous date rule can indeed be resolved using antipattern:
&weekdays;
0{0,1}[1-9]|1[012]
/
0{0,1}[1-9]|1[012]
/
\d\d\d\d
So this is a kind of 'macro'..
Good to know that exists. Might come in handy for some type of exceptions..
Ruud
> On 2014-10-07 18:39, R.J. Baars wrote:
>
>> Will the entries below be necessary to have the datechecker working?
>>
>> Ruud
>>
>>
Will the entries below be necessary to have the datechecker working?
Ruud
]>
--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box P
Daniel, one of the date checks was commented out.
I think it could still be of use, if the ambiguous items were removed,
e.g. using antipattern.
Somewhat like this?
(Was not able to test it yet..)
Ruud
-
I know. I will check the existing rules for EN and DE. Probably re-usable.
Will get the updated build tomorrow.
Ruud
> On 2014-10-07 13:51, R.J. Baars wrote:
>
>> It was not too difficult to translate that. I attached my proposal.
>> I am not able however to add it to the
It is better to disable the rule:
OT_EINDE_ZIN_ONVERWACHT [1]
And there is still quite a bit of output showing wiki markup.
Would it be an idea to:
- make a javascript component for the wiki page, interpreting the page
layout (which is html then) and checking the texts from that point?
Ruud
About more semantic rule, what about time consistency?
About the date check, I have been looking at the code, wanting to make a
Dutch version, but there is no locale that fits Netherlands and Belgium;
in fact there is none.
Is there a way to work around that?
Ruud
> Hi,
>
> our new rule that c
Done for Dutch.
It is not really a problem, is it?
Ruud
> Hi
>
> I've noticed that the Russian and Dutch
> "compounds.txt" files contain duplicate entries.
> Either the dupes should be removed, or maybe
> some of the dupe were meant to be the plural
> form or some other flexions. Can the languag
Thanks. I will keep the process in mind for next releases.
(There were 60 rules added just this week, which have to be tested for
soem time to be able to check the false positives...)
Ruud
> On 2014-10-06 20:04, R.J. Baars wrote:
>
>> After that 2.7 was released, but as far as I
stuff; I
would like to see it corrected.
Ruud
> On 2014-10-06 18:17, R.J. Baars wrote:
>
>> I guess the version of LT that was delpoyed to Wikipedia for Dutch,
>> contains all rules, not planned to release yet.
>>
>> I think it is better to replace the grammar.xml t
I guess the version of LT that was delpoyed to Wikipedia for Dutch,
contains all rules, not planned to release yet.
I think it is better to replace the grammar.xml there asap, from the
current production version.
Ruud
-
Marcin,
would it be possible to use the morfologik speller as a separate program,
to throw a list of words at, and get the alternatives?
Is there an example program that does that?
Ruud
--
Meet PCI DSS 3.0 Compliance
Is there a more efficient way to detect 2 tokens in one sentence or maybe
in a range of tokens?
The only way I know now is to make 2 rules, one word worda ... wordb and
one for wordb ... worda.
Ruud
--
Meet PCI DSS 3.0 C
The suggestion mechanism of the Morfologik speller using word frequencies
is WAY better than the suggestion mechanism for Hunspell.
In fact, the first suggestion is almost all the time the right one.
Well done!
Ruud
--
I could only make the assumption about cells and headers being rather
short...
It is worth trying.
Thanks.
Ruud
> W dniu 2014-10-02 o 08:25, R.J. Baars pisze:
>> I produced a rule, signaling an unexpected end of a sentence, like a
>> sentence not ending with a char like . ! or ?
I produced a rule, signaling an unexpected end of a sentence, like a
sentence not ending with a char like . ! or ?
But this is quite common to happen inside table cells or in headings.
LT is not aware of these things, is it? Has anyone found a way to prevent
false alarms in these header or cell c
Are phrases still supported and planned to be so for a long time to come?
It might be a good way to have (incorrect and correct) phrases to build
error-rcatching sentences from.
Ruud
--
Meet PCI DSS 3.0 Compliance Requi
It appears you are thinking of rules, quite different than the ones I am
thinking of.
We will see in time ...
Ruud
> W dniu 2014-09-24 o 21:03, R.J. Baars pisze:
>> Maybe we agree to disagree..
>>
>> Having them as one token makes detecting patterns easy using regular
>
I would like to have TaalTik added to the contributors for Dutch,
www.taaltik.nl
Ruud
> Hi,
>
> I made an improvement to our language overview page at
> https://languagetool.org/languages: it now displays an activity bar,
> based on the number of commits for that language in the last 6 months
> (
Some time ago I informed you on my experiment getting sentence patterns
from the corpus.
Th current status is that I was able to pinpoint the most common patterns:
8462 {NN1d}.
7316 {DTd} {NN1d} {VB3} {AJn}.
5830 {DTd} {NN1d} is {AJn}.
5710 {AJe} {NN1d}.
5641 De {NN1d} {VB3} {AJn}.
Maybe an additional idea is to use the edit distance relative to the word
size as well (when no frequencies are available).
A 2 letter distance in a 4 letter word is very bad, while it is of less
significance in a 10-letter word.
(I don know which algorithms are use right now, so I could be sugge
For the word 'sex', (the most common mistake in Dutch), are suggested:
seks; AEX; Bex; Mex; Pex; SEB; SEM; SEN; SEP; SER; Seb; Sef; Sem; Sen;
Sep; ex; sax; sec; sekse; set; sexy; Dex; LEX; Lex; PEX; REX; Rex; SEF;
SIX; Six; TeX; Tex; seks-; sekst; Şen
Context: Dit is sex.
The first one is the per
since words are too different.)
The other rules, multi-word ones, will take more time; some words might
be acceptable as Dutch, some might not. This will make things a bit more
complex.
Ruud
> 2014-09-27 11:06 GMT+02:00 R.J. Baars :
>
>>
>> It is all about suggesting a Dutch w
he people
wanting to be strict.
Ruud
> 2014-09-27 11:06 GMT+02:00 R.J. Baars :
>
>>
>> It is all about suggesting a Dutch word for a loanword.
>>
>
> Then why don't you use a simple replace rule (in Java)? You can use the
> existing o
Okay. It is the only char so far encoded that way then.
Ruud
> On 2014-09-27 10:13, R.J. Baars wrote:
>
>> How do I get an & as token? It generates an error:
>>
>> &
Okay. I will first have to check the results for frequency of getting
triggered. They have to be different rules, maybe be in a rulegroup.
It is all about suggesting a Dutch word for a loanword.
Ruud
> On 2014-09-26 21:53, R.J. Baars wrote:
>
>> Will adding 5000 rules lead to prob
How do I get an & as token? It generates an error:
&
Exception in thread "main" java.io.IOException: Cannot load or parse
'/org/languagetool/rules/nl/grammar.xml'
at
org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:130)
at
org.languagetool.rules.patterns
Tonight's build like a charm; no issues with Dutch. So I will not update
any file until the release has been done.
Ruud
--
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Sta
I received permission, a long time ago, to use a list of loan words for
rules.
It is a list of almost 5000 loan words from English; it is possible to
generate most rules from the file directly.
Will adding 5000 rules lead to problems?
(Of course I will have to check them for amount of positives;
Daniel, would you consider changing the separating char in compoundrule
from - into any non-word char, like ~, = ?
It would help be a lot fighting the English disease, writing words apart
that should be written together.
Since the replacerule does not support spaces before the = (as far as I
see)
Okay, I will make rules and exceptions for all of those words.
(When the wrongwordincontext is not effective that is. But it is much
easier to detect context with that...)
Ruud
> On 2014-09-26 10:57, R.J. Baars wrote:
>
>> Some examples: gent / Gent (bird, city)
>
> Are there
There are word confusions where there is no context to go on.
I have been checking some word in the wrongwordsincontext, by actually
getting the words in sentences with those words, and comparing their
frequencies from my corpus.
Some confusions are simply without significant context differences.
Great!
> On 2014-09-25 07:54, R.J. Baars wrote:
>
>> I get the feeling ignore.txt might not be working correctly for the
>> Dutch
>> Mofologikspeller.
>
> That's right, there was a bug because I renamed the "hunspell" directory
> to &quo
How do I make the compoundrule suggests
ouder-kindrelatie
ouder-kind-relatie
but not
ouderkind-relatie
nor
ouderkindrelatie
?
There are some special rules about the - in Dutch.
I can also use the simplereplacerule for cases like this, But I think that
is less 'elegant'.
Ruud
--
I might have found the solution for this; a different process was killing
languagetool.
Sorry to have bothered you.
Ruud
> Generating the spelling dictionary often goes wrong when run from bash.
> It appears to just stop, not giving an error, not giving 'Done'.
> (It runs okay from the command
Generating the spelling dictionary often goes wrong when run from bash.
It appears to just stop, not giving an error, not giving 'Done'.
(It runs okay from the command line, but that is inconvenient because of
the strange names and location of the generated files..)
the command:
java -cp LanguageT
I get the feeling ignore.txt might not be working correctly for the Dutch
Mofologikspeller.
In the list is 'ipv'; still it gets reported as a spelling error for
Mofologikspeller.
How come?
Ruud
Start controle in Nederlands...
This is the morfologik spelling rule:
1. Regel 1, kolom 1
Melding:
Would it be possible to use the matched tokens in the url?
I could use that to direct users directly to more info about the word for
errors in 'de' and 'het' on a website showing what it should be:
http://woordenlijst.org/zoek/?q=molton
Ruud
--
gt; form and it'll be different for whole and fractional number endings...
>>
>> And if many documents treat dot as comma would not it make sense to
>> create a rule that catches that and proposes correct format?
>>
>> Andriy
>>
>> 2014-09-24 10:53 GMT-04
the
> language).
>
> Andriy
>
>
> 2014-09-24 8:03 GMT-04:00 R.J. Baars :
>> Numbers like 1.234 or 1,000.00 are tokenized into several tokens, while
>> it
>> is one number.
>>
>> What do you think about changing the tokenizer to treat them as one
>&g
Numbers like 1.234 or 1,000.00 are tokenized into several tokens, while it
is one number.
What do you think about changing the tokenizer to treat them as one
number? This would maybe affect all languages having rules concerning
numbers, so this is not the right time, but maybe after releasing 2.7?
Sometimes an entire sentence is quoted inside another sentence, like:
Wat is ie groot! is een gevleugelde uitspraak.
In these case, currently there is a false alarm on 'is' to be a sentence
start without a capital.
Is Dutch the only language having this?
There is no real standard for quoting,
I see. You are probably thinking of all those proper names, type numbers
etc. ?
> On 2014-09-22 16:47, R.J. Baars wrote:
>
>> Is the intention to activate the spellchecker for Wikipedia now?
>
> No, that won't happen until we have a clear idea how to avoid false
> ala
1 - 100 of 357 matches
Mail list logo