Re: new language

2015-01-28 Thread R.Baars
An artificial language. Okay.

That explains why it is not in Ethologue.

Ruud

Op 28-01-15 om 15:02 schreef Буковинець:
 Hi R.Baars,

 Wednesday, January 28, 2015, 2:07:14 PM, you wrote:

 What is the name of that language  in English, what is its ISO language
 code ?
 Sensar has no ISO code, becouse it`s new language and
 is in the process of creating.



--
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: new language

2015-01-28 Thread R.Baars
What is the name of that language  in English, what is its ISO language 
code ?

Op 28-01-15 om 12:45 schreef Буковинець:
 Hi Daniel,

 Wednesday, January 28, 2015, 12:59:29 AM, you wrote:

 great, welcome the LanguageTool! Which new language would you like to
 write rules for? You can actually just start writing rules by adding
 them to the grammar.xml file of a different language, as described at
 http://wiki.languagetool.org/development-overview. You won't be able to
 use advanced features like part-of-speech tagging, but it should be good
 enough to get started.
 Regards
Daniel

 I would like to write rules for Сенсар: 
 http://ar25.org/article/sensar-magichna-mova-novoyi-rasy.html
 Ок, I will try add rules to grammar.xml file of Ukrainian.




--
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Remark on spell check quality

2015-01-21 Thread R.Baars
I have been looking into a lot of spellcheck files (hunspell) lately.

Most of those don't care about case. So most will support a word like 
th, because there is the chemical element Th. (Same for all other elements).
When converting a spellcheck file, it could be worth checking out if 
those errors are transported to the LT checker as well.

Ruud


--
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: spell checker enhancement

2014-09-16 Thread R.Baars
I know it will be simple to generate ignore rule like this, And I will 
probably do that, as soon as they pop up in the frequency table.

Ruud

Op 16-09-14 om 12:01 schreef Marcin Miłkowski:
 W dniu 2014-09-16 o 11:21, R.J. Baars pisze:
 Marcin,

 We don't agree. There is a spellchecker, but also a single word ignore
 list for it.
 Yes, but for multi-words, we'd have to use the disambiguator code
 internally anyway. You ask for yet another notation of the same thing.

 Notice also that no spell checker will propose Tel Aviv for Aviv.
 You need to have an XML rule for that. A simple one, to be sure, but
 still an XML rule. I think it's pretty trivial to go through a list of
 such words and create parallel lists of ignore-spelling rule for
 disambiguation and missing part grammar rules.

 Regards,
 Marcin

 There are XML rules, but also a Simplereplace rule, a compounding rule.

 So apart from the hammer and the screwdriver, there are more tools.

 But anyway, adding the most frequent ones tot the disambiguator works.

 Getting rid of wrong postags and 10% reported possible spelling errors on
 the entire corpus is a higher priority.
 And fixing false positives. Having almost doubled the amount or rules is
 enough for this month.

 Ruud



 W dniu 2014-09-16 o 09:03, R.J. Baars pisze:
 A word like 'Aviv'is not correct unless 'Tel' is before it.
 So it is best to leave Tel and Aviv out of the spell checker.
 That results in spell checking reporting errors for Aviv.

 In the disambiguator, there is the option to block that, by making an
 immunizing rule:

  !-- Tel Aviv--
  rule id=TEL_AVIV name=Tel Aviv
pattern
  tokenTel/token
  tokenAviv/token
/pattern
disambig action=ignore_spelling/
  /rule

 That works perfectly. But then, there are a lot of these word
 combinations. Wouldn't it be better to have a multi-word ignore list for
 the spell checker?

 (Or even a multi-word spell checker, not just knowing 'correct' and 'not
 in list', but 'correct', 'incorrect' and 'not in list')
 It would not be an enhancement, as this would not give new functionality
 but cripple the existing one. Also, the ability to use all XML syntax is
 extremely important to me (I use POS tags and regular expressions), so I
 wouldn't make use of the multi-word spell checker anyway. So we'd have
 to introduce a crippled syntax that would look a little bit different
 for a human being but with no meaningful functional change. I don't
 think it's worth our time.

 The spell checker is best for checking individual words. Just like a
 hammer, it's good for nails, and not for screws. For screws, we have a
 screwdriver. For multi-word entities, we have more refined tools, like
 tagging and disambiguation and special attributes.

 Best,
 Marcin

 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: spell checker enhancement

2014-09-16 Thread R.Baars
I see. This is probably of no use for spellchecking, but it is for 
postagging.



Does
Abu Dhabi NPCNG00
cause both words to be tagged with that tag, or are they considered 1 
token with that postag?


(Might come in handy for just this tagging..)

Ruud

Op 16-09-14 om 12:56 schreef Jaume Ortolà i Font:

Hi, Ruud.

I don't find any documentation. It is used in Polish, French, Catalan, 
Russian, Ukrainian and Spanish.


Implementation:

Enable it (Java).
Create a multiwords.txt in your resources folder like these [1]. The 
tokens are separated by white space and the tag is separated by a tab.


Result:

The first token of the multiword is tagged with POSTAG and the 
last token is tagged with /POSTAG.


The MultiwordChunker is case-insensitive. I would like to make it 
configurable, specially for first letter uppercase.


Regards,
Jaume


[1] 
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/multiwords.txt


https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/resources/org/languagetool/resource/ca/multiwords.txt

2014-09-16 12:33 GMT+02:00 R.Baars baar...@xs4all.nl 
mailto:baar...@xs4all.nl:


Jaume, thanks, but I am not sure.

Depends on its implementation I think.

Where can I find more info?

Ruud

Op 16-09-14 om 12:26 schreef Jaume Ortolà i Font:

2014-09-16 11:21 GMT+02:00 R.J. Baars r.j.ba...@xs4all.nl
mailto:r.j.ba...@xs4all.nl:

We don't agree. There is a spellchecker, but also a single
word ignore
list for it.
There are XML rules, but also a Simplereplace rule, a
compounding rule.

So apart from the hammer and the screwdriver, there are more
tools.


There is indeed another tool for multi-words. It seems that Ruud
doesn't know it.

We can enable a HybridDisambiguator and add a MultiwordChunker to
the disambiguation. With this you can write a list of
multi-words with its corresponding tag in a plain text file
(multiwords.txt).

I use the MultiwordChunker with two objectives: improve
disambiguation and avoid spelling matches in multiwords.

Would it be useful for you, Ruud?

Regards,
Jaume



But anyway, adding the most frequent ones tot the
disambiguator works.

Getting rid of wrong postags and 10% reported possible
spelling errors on
the entire corpus is a higher priority.
And fixing false positives. Having almost doubled the amount
or rules is
enough for this month.

Ruud



 W dniu 2014-09-16 o 09:03, R.J. Baars pisze:
 A word like 'Aviv'is not correct unless 'Tel' is before it.
 So it is best to leave Tel and Aviv out of the spell checker.
 That results in spell checking reporting errors for Aviv.

 In the disambiguator, there is the option to block that,
by making an
 immunizing rule:

!-- Tel Aviv--
rule id=TEL_AVIV name=Tel Aviv
  pattern
 tokenTel/token
 tokenAviv/token
  /pattern
  disambig action=ignore_spelling/
/rule

 That works perfectly. But then, there are a lot of these word
 combinations. Wouldn't it be better to have a multi-word
ignore list for
 the spell checker?

 (Or even a multi-word spell checker, not just knowing
'correct' and 'not
 in list', but 'correct', 'incorrect' and 'not in list')

 It would not be an enhancement, as this would not give new
functionality
 but cripple the existing one. Also, the ability to use all
XML syntax is
 extremely important to me (I use POS tags and regular
expressions), so I
 wouldn't make use of the multi-word spell checker anyway.
So we'd have
 to introduce a crippled syntax that would look a little bit
different
 for a human being but with no meaningful functional change.
I don't
 think it's worth our time.

 The spell checker is best for checking individual words.
Just like a
 hammer, it's good for nails, and not for screws. For
screws, we have a
 screwdriver. For multi-word entities, we have more refined
tools, like
 tagging and disambiguation and special attributes.

 Best,
 Marcin



--
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce.
 Perforce version control. Predictably reliable.


http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk

Re: spell checker enhancement

2014-09-16 Thread R.Baars

How is that done?

Ruud


Op 16-09-14 om 13:23 schreef Jaume Ortolà i Font:
2014-09-16 13:03 GMT+02:00 R.Baars baar...@xs4all.nl 
mailto:baar...@xs4all.nl:


I see. This is probably of no use for spellchecking, but it is for
postagging.


It gives no suggestions, but it can be used for avoiding false 
positives in spellchecking, if you set that tagged words are to be 
ignored.



Does
Abu Dhabi NPCNG00
cause both words to be tagged with that tag, or are they
considered 1 token with that postag?


Tokenization is not changed. In this case:

token postag=NPCNG00Abu/token
token postag=/NPCNG00Dhabi/token

if there are more than two tokens, the inside tokens are not tagged. 
Perhaps this should be optionally changed (ie, tag the inside tokens too).


Regards,
Jaume



(Might come in handy for just this tagging..)

Ruud

Op 16-09-14 om 12:56 schreef Jaume Ortolà i Font:

Hi, Ruud.

I don't find any documentation. It is used in Polish, French,
Catalan, Russian, Ukrainian and Spanish.

Implementation:

Enable it (Java).
Create a multiwords.txt in your resources folder like these
[1]. The tokens are separated by white space and the tag is
separated by a tab.

Result:

The first token of the multiword is tagged with POSTAG and
the last token is tagged with /POSTAG.

The MultiwordChunker is case-insensitive. I would like to make it
configurable, specially for first letter uppercase.

Regards,
Jaume


[1]

https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/multiwords.txt


https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/resources/org/languagetool/resource/ca/multiwords.txt

2014-09-16 12:33 GMT+02:00 R.Baars baar...@xs4all.nl
mailto:baar...@xs4all.nl:

Jaume, thanks, but I am not sure.

Depends on its implementation I think.

Where can I find more info?

Ruud

Op 16-09-14 om 12:26 schreef Jaume Ortolà i Font:

2014-09-16 11:21 GMT+02:00 R.J. Baars r.j.ba...@xs4all.nl
mailto:r.j.ba...@xs4all.nl:

We don't agree. There is a spellchecker, but also a
single word ignore
list for it.
There are XML rules, but also a Simplereplace rule, a
compounding rule.

So apart from the hammer and the screwdriver, there are
more tools.


There is indeed another tool for multi-words. It seems that
Ruud doesn't know it.

We can enable a HybridDisambiguator and add a
MultiwordChunker to the disambiguation. With this you can
write a list of multi-words with its corresponding tag in
a plain text file (multiwords.txt).

I use the MultiwordChunker with two objectives: improve
disambiguation and avoid spelling matches in multiwords.

Would it be useful for you, Ruud?

Regards,
Jaume



But anyway, adding the most frequent ones tot the
disambiguator works.

Getting rid of wrong postags and 10% reported possible
spelling errors on
the entire corpus is a higher priority.
And fixing false positives. Having almost doubled the
amount or rules is
enough for this month.

Ruud



 W dniu 2014-09-16 o 09:03, R.J. Baars pisze:
 A word like 'Aviv'is not correct unless 'Tel' is
before it.
 So it is best to leave Tel and Aviv out of the spell
checker.
 That results in spell checking reporting errors for Aviv.

 In the disambiguator, there is the option to block
that, by making an
 immunizing rule:

!-- Tel Aviv--
rule id=TEL_AVIV name=Tel Aviv
 pattern
 tokenTel/token
 tokenAviv/token
 /pattern
 disambig action=ignore_spelling/
 /rule

 That works perfectly. But then, there are a lot of
these word
 combinations. Wouldn't it be better to have a
multi-word ignore list for
 the spell checker?

 (Or even a multi-word spell checker, not just knowing
'correct' and 'not
 in list', but 'correct', 'incorrect' and 'not in list')

 It would not be an enhancement, as this would not give
new functionality
 but cripple the existing one. Also, the ability to use
all XML syntax is
 extremely important to me (I use POS tags and regular
expressions), so I
 wouldn't make use of the multi-word spell checker
anyway. So we'd have

Re: spell checker enhancement

2014-09-16 Thread R.Baars

Okay, thanks. Good to know this.
This is however not the time to do that; currently, there is a lot more 
to do to make what is already in the Dutch LT of better quality.
I will keep it in mind for later, when I have a more clear view of 
remaining spelling issues.
(Currently 10% of the collected sentences from internet sources have at 
least 1 spelling error.)


Ruud

Op 16-09-14 om 15:31 schreef Jaume Ortolà i Font:
2014-09-16 14:43 GMT+02:00 R.Baars baar...@xs4all.nl 
mailto:baar...@xs4all.nl:


How is that done?

Ruud


Do you mean ignoring tagged words in spellchecking (even if they are 
not in the dictionary)? It's a configurable option of the speller (at 
least in the Morfologik speller rule). A line of Java code.


Jaume




Op 16-09-14 om 13:23 schreef Jaume Ortolà i Font:

2014-09-16 13:03 GMT+02:00 R.Baars baar...@xs4all.nl
mailto:baar...@xs4all.nl:

I see. This is probably of no use for spellchecking, but it
is for postagging.


It gives no suggestions, but it can be used for avoiding false
positives in spellchecking, if you set that tagged words are to
be ignored.


Does
Abu Dhabi NPCNG00
cause both words to be tagged with that tag, or are they
considered 1 token with that postag?


Tokenization is not changed. In this case:

token postag=NPCNG00Abu/token
token postag=/NPCNG00Dhabi/token

if there are more than two tokens, the inside tokens are not
tagged. Perhaps this should be optionally changed (ie, tag the
inside tokens too).

Regards,
Jaume



(Might come in handy for just this tagging..)

Ruud

Op 16-09-14 om 12:56 schreef Jaume Ortolà i Font:

Hi, Ruud.

I don't find any documentation. It is used in Polish,
French, Catalan, Russian, Ukrainian and Spanish.

Implementation:

Enable it (Java).
Create a multiwords.txt in your resources folder like
these [1]. The tokens are separated by white space and the
tag is separated by a tab.

Result:

The first token of the multiword is tagged with POSTAG
and the last token is tagged with /POSTAG.

The MultiwordChunker is case-insensitive. I would like to
make it configurable, specially for first letter uppercase.

Regards,
Jaume


[1]

https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/multiwords.txt


https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/ca/src/main/resources/org/languagetool/resource/ca/multiwords.txt

2014-09-16 12:33 GMT+02:00 R.Baars baar...@xs4all.nl
mailto:baar...@xs4all.nl:

Jaume, thanks, but I am not sure.

Depends on its implementation I think.

Where can I find more info?

Ruud

Op 16-09-14 om 12:26 schreef Jaume Ortolà i Font:

2014-09-16 11:21 GMT+02:00 R.J. Baars
r.j.ba...@xs4all.nl mailto:r.j.ba...@xs4all.nl:

We don't agree. There is a spellchecker, but also a
single word ignore
list for it.
There are XML rules, but also a Simplereplace rule,
a compounding rule.

So apart from the hammer and the screwdriver, there
are more tools.


There is indeed another tool for multi-words. It seems
that Ruud doesn't know it.

We can enable a HybridDisambiguator and add a
MultiwordChunker to the disambiguation. With this you
can write a list of multi-words with its
corresponding tag in a plain text file (multiwords.txt).

I use the MultiwordChunker with two objectives: improve
disambiguation and avoid spelling matches in multiwords.

Would it be useful for you, Ruud?

Regards,
Jaume



But anyway, adding the most frequent ones tot the
disambiguator works.

Getting rid of wrong postags and 10% reported
possible spelling errors on
the entire corpus is a higher priority.
And fixing false positives. Having almost doubled
the amount or rules is
enough for this month.

Ruud



 W dniu 2014-09-16 o 09:03, R.J. Baars pisze:
 A word like 'Aviv'is not correct unless 'Tel' is
before it.
 So it is best to leave Tel and Aviv out of the
spell checker.
 That results in spell checking reporting errors
for Aviv.

 In the disambiguator, there is the option to
block that, by making

Re: simplify GenericUnpairedBracketsRule implementation

2014-09-15 Thread R.Baars
I think it would be great to be able to maken multi-sentence rules in 
XML ...
I know that is not what is suggested here, but nevertheless...

Ruud

Op 15-09-14 om 13:00 schreef Daniel Naber:
 Hi,

 GenericUnpairedBracketsRule detects quotes that do not get closed etc.
 So what it does isn't overly complicated, but its implementation is a
 bit convoluted. I think that's mostly because a quote can be opened in
 one sentence and closed in another one and that's not an error. But our
 rules only get single sentences.

 GenericUnpairedBracketsRule calls Rule.setAsDeleted(), a method that is
 only used from GenericUnpairedBracketsRule. Rule again has a field
 ListRuleMatch removedMatches, which adds state to the Rule object,
 which should just be a static rule that doesn't change once created.
 There are more methods like Rule.isInRemoved() which only seem to be
 used for GenericUnpairedBracketsRule.

 What about adding a new type of rule that doesn't get a single sentence,
 but the complete text (as a List of AnalyzedSentence objects)? This way
 the rule can iterate over the complete text and doesn't need to keep its
 state between sentences. It won't need to implement reset() either.

 I haven't tried yet whether this works as expected, I wanted to ask for
 your opinion first.

 Regards
Daniel


 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Dutch WikiCheck

2014-09-15 Thread R.Baars
My problem is there are enormous amounts of errors generated by the 
checks where wiki mark-up is met. Especially name= etc.

It is not for me, but for any wikipedia user checking pages ..

Maybe a built-in parsoid-like routine?

What is it we do check? Is it enough when all wiki mark-up is hidden 
inside a kind of tag?

Ruud

Op 15-09-14 om 13:16 schreef Daniel Naber:
 On 2014-09-15 10:50, R.J. Baars wrote:

 How can I improve LT specifically for Wikipedia?
 I would like to remove all false positives, caused by the Wiki markup.
 Here's what I think is the proper solution (for the use case of checking
 the recent changes feed):

 -send the old version of the page to Parsoid (Parsoid is a system that
 would do the parsing for us, turning Wikipedia markup into something we
 can actually parse - see https://www.mediawiki.org/wiki/Parsoid)
 -send the new version of the page to Parsoid
 -make an XML diff to see where the changes are
 -run the LT on the text of the paragraphs that have been changed

 The problem with this is that a) it needs considerable development
 effort and b) it makes the checking slower and less robust due to the
 additional http requests that it requires.

 Working on the rules (grammar.xml/disambiguation.xml) to prevent false
 alarms caused by Wikipedia markup is a hack, I wouldn't recommend that.

 I can adjust rules, but how do I get them there to see the results?
 You could use this (available from
 https://languagetool.org/download/snapshots/):
 java -jar languagetool-wikipedia.jar wiki-check
 http://de.wikipedia.org/wiki/Bielefeld

 Regards
Daniel


 --
 Want excitement?
 Manually upgrade your production database.
 When you want reliability, choose Perforce
 Perforce version control. Predictably reliable.
 http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


include xml

2014-08-23 Thread R.Baars
I am quite sure that I read somewhere the grammar.xml can include 
another xml, but cannot find the instructions again.
Can someone help me on this?

Ruud

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Dump

2014-05-27 Thread R.Baars
I did so. Will have to wait some time until the process will skip to 
another input file, but I will keep you informed.

Ruud

op 27-05-14 11:06, Marcin Miłkowski schreef:
 Hi,

 maybe it was because of a simple mistake in the isNumberOrDot() method.
 I fixed it, so the today's build should run fine. Could you download the
 nightly and see whether you get crashes on your data?

 Best,
 Marcin

 W dniu 2014-05-27 09:20, R.J. Baars pisze:
 Hi.

 I am currently using languagetool-commandline checking billions of
 paragraphs.

 It works fine, except for some dumps, like the ones below.

 I need the tool to continue, for I need the data. When it has been
 processed, I might try to find the items it crashes on.

 It looks like it is all string things. Could it crash in utf8 encoding
 errors?

 Ruud

java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
   at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:101)
   at org.languagetool.JLanguageTool.check(JLanguageTool.java:576)
   at org.languagetool.JLanguageTool.check(JLanguageTool.java:534)
   at org.languagetool.JLanguageTool.check(JLanguageTool.java:530)
   at
 org.languagetool.commandline.CommandLineTools.checkText(CommandLineTools.java:96)
   at org.languagetool.commandline.Main.handleLine(Main.java:386)
   at
 org.languagetool.commandline.Main.runOnFileLineByLine(Main.java:286)
   at org.languagetool.commandline.Main.runOnFile(Main.java:166)
   at org.languagetool.commandline.Main.main(Main.java:519)
 Caused by: java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   at java.util.concurrent.FutureTask.get(FutureTask.java:188)
   at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:98)
   ... 8 more
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
 range: 0
   at java.lang.String.charAt(String.java:658)
   at
 org.languagetool.rules.CommaWhitespaceRule.isNumberOrDot(CommaWhitespaceRule.java:130)
   at
 org.languagetool.rules.CommaWhitespaceRule.match(CommaWhitespaceRule.java:92)
   at
 org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:686)
   at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:995)
   at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:962)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Exception in thread main java.lang.RuntimeException:
 java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
   at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:101)
   at org.languagetool.JLanguageTool.check(JLanguageTool.java:576)
   at org.languagetool.JLanguageTool.check(JLanguageTool.java:534)
   at org.languagetool.JLanguageTool.check(JLanguageTool.java:530)
   at
 org.languagetool.commandline.CommandLineTools.checkText(CommandLineTools.java:96)
   at org.languagetool.commandline.Main.handleLine(Main.java:386)
   at
 org.languagetool.commandline.Main.runOnFileLineByLine(Main.java:286)
   at org.languagetool.commandline.Main.runOnFile(Main.java:166)
   at org.languagetool.commandline.Main.main(Main.java:519)
 Caused by: java.util.concurrent.ExecutionException:
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   at java.util.concurrent.FutureTask.get(FutureTask.java:188)
   at
 org.languagetool.MultiThreadedJLanguageTool.performCheck(MultiThreadedJLanguageTool.java:98)
   ... 8 more
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
 range: 0
   at java.lang.String.charAt(String.java:658)
   at
 org.languagetool.rules.CommaWhitespaceRule.isNumberOrDot(CommaWhitespaceRule.java:130)
   at
 org.languagetool.rules.CommaWhitespaceRule.match(CommaWhitespaceRule.java:92)
   at
 org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:686)
   at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:995)
   at
 org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:962)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at
 

Re: homophone detection

2014-05-07 Thread R.Baars
Good work has been done using this and more sophisticated tools by the 
universities of Nijmegen and Tilburg, by A. v d Bosch et al.

Their tools are also fully open source.

These tools got public as 'valkuil.net' and 'fowl.net'. It requires a 
quite heavy server.

In case you are interested, Prof. v d Bosch is one of my Linked-In-contacts:

https://www.linkedin.com/profile/view?id=5639559authType=OUT_OF_NETWORKauthToken=4n9Zlocale=en_USsrchid=525963551399472523579srchindex=1srchtotal=3524trk=vsrp_people_res_nametrkInfo=VSRPsearchId%3A525963551399472523579%2CVSRPtargetId%3A5639559%2CVSRPcmpt%3Aprimary

Ruud


op 07-05-14 16:16, Daniel Naber schreef:
 Hi,

 as you may know, After the Deadline is an Open Source text checker,
 quite similar to LT. It's not maintained anymore, so why not use some of
 its ideas in LT? A paper describing AtD is available at [1], it's
 well-written and provides a good overview of AtD.

 One interesting idea is to detect wrong words based on statistics. AtD
 has a (manually created) set of words that can be easily confused. If
 such a word is found in a text, the probability of that word in its
 context is calculated and compared to the probability of the similar
 words in the same context. If the word from the text is less probable,
 an error is assumed, and a more probable word is suggested.

 If this approach works, it's easier than writing rules: just add a set
 of easily confused words like adapt, adopt to a file, and the rest
 will happen automatically. What you need though is a huge corpus to
 calculate the probabilities. The Google n-gram corpus[2] might be used
 for that.

 AtD has been evaluated against a dyslexia corpus[3] with a recall of
 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get
 only 19% recall, and that only considers if an error was detected, not
 if the correction was correct. So there's clearly something to gain for
 LT here.

 I have checked in some prototypical work for a statistical homophone
 rule in LT into the new 'confusion-rule' branch.

 Regards
Daniel

 [1] http://aclweb.org/anthology-new/W/W10/W10-0404.pdf
 [2] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
 [3] http://www.dcs.bbk.ac.uk/~jenny/resources.html


 --
 Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
 #149; 3 signs your SCM is hindering your productivity
 #149; Requirements for releasing software faster
 #149; Expert tips and advice for migrating your SCM now
 http://p.sf.net/sfu/perforce
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Wiki

2014-01-10 Thread R.Baars
Yes, that is what I meant, thanks.

I am able to log in, but not to edit.

After clicking on 'edit', I get a Permission error: Sorry, you can not 
edit this page. Only members 


Ruud

op 10-01-14 17:19, Daniel Naber schreef:
 On 2014-01-09 16:26, R.J. Baars wrote:

 Could someone please this text from the wiki?

 Interactive testing of rules using a corpus
 I have removed that paragraph (I guess that's what you meant).

 My account does not allow edit.
 It should. Can you log in? What's the error message you get?

 Regards
Daniel



 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: showing example sentences in GUI

2013-12-02 Thread R.Baars
Aha, that makes sense ...

Ruud

op 02-12-13 13:19, Daniel Naber schreef:
 On 2013-12-02 13:10, R.J. Baars wrote:

 Why show sentences with errors to people that need help getting it
 right?
 It is not an objection, more a question of : is there a reason from the
 user perspective?
 Usually the sentences come in pairs: an incorrect sentence, and the same
 sentence with the error corrected. The reason I'm asking is that I think
 some of the error descriptions are not trivial to understand, so an
 example might help.

 Regards
Daniel



--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: planning LT homepage relaunch

2013-10-28 Thread R.Baars
Idea: map the languages on a globe (might be hard for languages that 
share countries)
Or just group them by continent?

op 28-10-13 19:40, Marcin Miłkowski schreef:
 W dniu 2013-10-28 16:57, Daniel Naber pisze:
 On 2013-10-23 20:08, Daniel Naber wrote:

 I've planned for long to relaunch the LT homepage with a better design.
 Now is the chance to finally do that. These are my *personal* ideas
 about the future of the LanguageTool homepage. I'm looking for
 feedback.
 So here's a more detailed concept. This can be used for further
 discussion and once we think that nothing important is missing, this is
 also what I will send to the designer. I'll leave out the idea of using
 a CMS for now because of two reasons: I don't know Joomla or another
 CMS, so it would be another thing to care about, additionally to the
 design and structure, and it's not trivial to use it technically: we
 cannot just switch the DNS for languagetool.org to point to some
 external CMS host because languagetool.org also runs our API.


 Elements we need on every page:

 -Logo and page title or a claim (on the homepage)
 -link to twitter, facebook, and imprint
 We could include the twitter feed to have announcements and news on the
 homepage, and make it easy for others to retweet.

 -field for subscription to announcement mailing list
 -site navigation


 For site navigation, I suggest these items with sub items:

 -Languages
 * Overview (what is today http://languagetool.org/languages/)
 * All 29 or so languages. This is a lot... some structure would be
 nice.
 That may be problematic because people who speak a language might be
 discouraged if they find, for example, linguistic groups...

 -Download
 * LO/OO
 * LanguageToolFx
 * Plugins - just links to external pages (now on
 http://languagetool.org/links/)

 -Support
 * FAQ - what's now the list of common problems,
 http://languagetool.org/issues/
 * Forum - external, not sure if it can be integrated nicely
 * Contact

 -Development
 * several links to the Wiki
 * Mailing List
 * Bug Reports
 * WikiCheck

 What's not covered is the 'screen shots' page and the 'usage' page. I
 think screen shots can be put on the download pages, and usage
 information is already mostly covered by the popup that opens on
 download.


 Homepage:

 The homepage contains an intro sentence and a text form for trying LT.
 It is pre-filled with an example text in the user's native language
 (assuming his browser language is just that). If we don't support that,
 we fall back to English. The text form doesn't have a drop-down anymore
 as today for selecting the language, as the language can be switched in
 the navigation. If the language currently selected has a detail page
 (like http://languagetool.org/fr/), there's a big link below the text
 box Learn more about LanguageTool's support for language that leads
 to that page (any better idea to integrate those language pages?).


 Visual Design:

 Modern, but not too stylish. I think http://git-scm.com is a nice
 example of a modern design.


 What do you think? Anything that is missing?
 I like these ideas.

 Regards,
 marcin


 Regards
 Daniel


 --
 October Webinars: Code for Performance
 Free Intel webinars can help you accelerate application performance.
 Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
 the latest Intel processors and coprocessors. See abstracts and register 
 http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: en bloc

2013-10-27 Thread R.Baars
That is french, Why not remove noun tags for this word group in the 
disambiguator?

op 27-10-13 07:08, Kumara Bhikkhu schreef:
They walked out of the _hall en bloc_. is flagged by rule 
id=THREE_NN name=Readability: Three nouns in a row


Not sure what's the best way to fix this. Using exception would solve 
this one but problems would remain for other rules if LT still regards 
en bloc as two nouns.


kb


--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk


___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: planning LT homepage relaunch

2013-10-24 Thread R.Baars
Good free templates as well as glossaries are available using a CMS like 
Joomla.
It also support multiple languages and switching between these. And it 
supports contirbutions by multiple people.

I could talk to my Joomla provider if I could offer you this service 
from the server I am already paying for, if you would appreciate it.

Ruu

op 24-10-13 09:32, Mike Unwalla schreef:
 A clearer website would be good. Sometimes, I struggle to remember where to
 find information. Information is fragmented in many different locations.

 You do not necessarily have to pay a web designer. Many designs are free.
 Refer to http://www.csszengarden.com. If you use a web designer, make sure
 that the web designer conforms to best practice (W3C standards for CSS,
 accessibility, internationalization). Many web designers know only about
 making a website look pretty.

 I like the idea of giving all languages equal status. English is just
 another option. I did not know that there are LT web pages in languages
 other than English.

 A search page would be useful. (For the TechScribe website, I use the free
 search tools from www.master.com.)

 When I first started to use LT, I struggled with much of the technical
 terminology. Both the LT terminology and the third-party terminology were
 not clear to me. A glossary would be useful. (I volunteer to develop a
 glossary.)

 Regards,

 Mike Unwalla
 Contact: www.techscribe.co.uk/techw/contact.htm


 -Original Message-
 From: Daniel Naber [mailto:list2...@danielnaber.de]
 Sent: 23 October 2013 19:09
 To: LanguageTool Developer List
 Subject: planning LT homepage relaunch

 Hi,

 I've planned for long to relaunch the LT homepage with a better design.
 Now is the chance to finally do that. These are my *personal* ideas
 about the future of the LanguageTool homepage. I'm looking for feedback.

 Our current homepage (http://languagetool.org) suffers from the
 following problems:

 -looks overcrowded
 -no nice, modern design
 -all non-English languages are just second-class citizens, for example
 they cannot easily be reached through the navigation bar
 -English doesn't have a language-specific page, with room for examples

 Also, but less important in my opinion:

 -it's not always clear what content belongs to the homepage,
 community.languagetool.org, and the Wiki
 -homepage, community.languagetool.org, and the Wiki have different
 layout styles
 -it makes use of a do-it-yourself PHP template system

 Proposed solution:

 -looks overcrowded:
-plan a navigation bar with only a few main items: Download, Languages,
 Support, Developer
-move everything from the homepage to sub pages except the try it
 online form and the navigation

 -no nice design:
-pay a professional designer to make a new design
-make a modern design but not so modern and stylish that it looks
 outdated in a year again

 -all non-English languages are just second-class citizens:
-translate the cleaned-up homepage to all languages we support and make
 sure the user is automatically redirected to the homepage that matches
 their browser's default language (which is probably their native
 language)
-use Transifex for translations of short texts
-make all languages reachable through the navigation bar
-keep the current language sub pages like http://languagetool.org/de/ -
 these are not direct translations from English but often contain more
 information

 -English doesn't have a language-specific page, with room for examples:
-make an English sub page just like for other languages

 As you can see, this only addresses the main issues, not what I list
 above as 'less important' issues. Please let me know what you think.

 Regards
Daniel



--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60135991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel