[Languagetool] presentation

2012-09-02 Thread Mauro Condarelli
Hi,
I'm interested in using LT as an advanced spell-checker for a document
editor I'm building (eclipse RCP).
I had a few mail exchange with Daniel Naber in the forum and he
redirected me here.

I'm a fairly seasoned programmer and I have a good command of Italian
and, to a lesser degree, English languages.
As a programmer I have good rearing in C, C++ and java, ranging from
linux kernel hacking to GUI building (mostly eclipse RCP and Qt).

I am Italian and my focus is primarily on the Italian language.

I am building a plugin providing ISpellingEngine and related classes,
compatible with all eclipse installations.

To this end I've found LT to be very promising, somewhat immature (as
may be expected).
Problems I've found are:

 1. Italian rules rise way too many false positive, especially for the
tense concordance rule (GR_10_001). I am willing to help refine the
rules.
 2. Spell checker does not have any way to implement "ignore word" and
"add to user dictionary", which is essential for interactive use. I
know Daniel is working on a kind of ignore using a file deep into
the file hierarchy; this doesn't help an interactive usage and IMHO
is not a solution for any serious use-case. I plan to support three
independent dictionaries, for each language, in my application:
standard, user, document.
 3. LT doesn't really understand Unicode (prevents usage on
htLaTeX-generated docs):
 1. it does not understand ligatures.
 2. it does not understand special apostrophe.
 3. it does not understand other special chars.
 4. LT understanding of XML is very limited as it does not understand
&xxx; constructs (prevents usage on my document source).

I didn't (yet) dig deep in LT code, but I have some ideas I wish to share.
In order to overcome the above difficulties I propose the following
actions (but I'm open to other suggestions, of course):

 1. I can help refining rules for Italian, as tester at beginning, more
active if and when I will learn how to write efficient rules.
 2. Give a configurable chance to use a different engine for spell
checking. The current means are really not useful for interactive
use and changing/upgrading dictionaries id not very straightforward.
I propose specifically to:
 1. define a standard interface between LT and the underlying
spellchecker.
 2. provide at least two spellcheckers (fsa and hunspell).
 3. make it possible to chose at runtime (via Preference Page).
 4. decouple checking from suggestion generation; i.e.: split the
current "check(document)" function into "review(document)" and
"suggest(word)". This would speed-up hunspell verify enough to
make it usable for interactive use and could be trivially
implemented by fsa doing everything in one step (as it is
currently doing) and returning data in two separate steps.
 5. provide a thin interface layer to control the underlying
spellchecker allowing things as (multiple) dictionary selection
and dictionary maintenance (add word). If the underlying is
unable to perform the operation the wrapper can simply return an
error letting the caller to decide what to do.
 3. Use Transliteration to remove all characters not really present in
dictionary (e.g. see:
http://unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml)
using either ICU or gnu iconv. This would allow to "normalize"
(after input processing of escape sequences, see (4)) input to what
is actually supported by dictionary.
 4. Make sure you correctly detect input structure and behave
accordingly. This is not really interesting for interactive use
since it must be part of the calling application, but could be very
important for standalone usage. This usually boils down to two things:
 1. detect parts that are not to be checked (e.g.: XML tags).
 2. replace escape sequences in sections that should be checked
(e.g.: """)

This is a kind of filter depending on input structure; it could be
made pluggable (to be future-proof) and could be either
auto-detected (mime type?), if possible, or specified via
command-line-option/API.

I know Daniel is not fond of too many user configurations, but IMHO the
needs LT can be called to fulfill are so different one size-fits-all
strategy is very likely to disappoint almost anyone.
I am ready to stand corrected, but the above reflects my current needs
and understanding of LT.
I am also available to help, in my ample spare time ;)

I will do my best to answer any comments in the next few hours, but I
will be out-of-town from tomorrow evening till Sept, 15th. After that I
will be (more or less) available again.

Thanks for the good work and
Best Regards
Mauro Condarelli
---

Re: [Languagetool] presentation

2012-09-02 Thread Mauro Condarelli
On domenica 2 settembre 2012 22:50:20, Daniel Naber wrote:

> It's for a different use case: our rules might make suggestions that are so
> specific that the spell checker doesn't know them, and it would be
> unfortunate if we correct to something that the spell checker then
> complains about. It is indeed something else than "ignore word" from the
> user's point of view (which isn't implemented yet - help is welcome).

Sorry, I fail to understand this.
IMHO the rules should be higher level than "simple" spell-checking.
There should never be the chance to break speller applying a rule.

>>  3. LT doesn't really understand Unicode (prevents usage on
>> htLaTeX-generated docs):
>>  1. it does not understand ligatures.
>>  2. it does not understand special apostrophe.
>>  3. it does not understand other special chars.
>
> There's http://en.wikipedia.org/wiki/Unicode_equivalence and we might want
> to use that.
agreed.

>>  4. LT understanding of XML is very limited as it does not understand
>> &xxx; constructs (prevents usage on my document source).
> This should be done outside of LT, we should basically only work on plain
> text.
Not really, if You want to provide a full-fledged application.
You would drastically restrict use cases.
Nowadays none is using "plain text" anymore (unfortunately).
What I am trying to say is:
Either You provide a full Unicode support (including transliteration 
from full Unicode to whatever encoding the dictionaries/rules support) 
or You need to support some (possibly pluggable) kind of mime-decoding.

>>  1. I can help refining rules for Italian, as tester at beginning, more
>> active if and when I will learn how to write efficient rules.
>
> That's great!
>
>>  1. define a standard interface between LT and the underlying
>> spellchecker.
>
> Everything that detects in error in LT is a subclass of Rule, and for spell
> checking we use SpellingCheckRule, which already has two subclasses (for
> Hunspell and for Morfologik).
I will dig into that ASAP.

>>  3. make it possible to chose at runtime (via Preference Page).
> As I mentioned, I'd rather prefer no configuration, as this is something too
> complex for the user to decide.

>>  3. Use Transliteration to remove all characters not really present in
>> dictionary (e.g. see:
> This seems to be a lossy step, so Unicode normalization (see above) might
> be more appropriate.
It really depends on if and how well dictionaries/rules support full 
Unicode (including ligatures, different kinds of apostrophe, quotes, 
...).

>> I know Daniel is not fond of too many user configurations, but IMHO the
>> needs LT can be called to fulfill are so different one size-fits-all
>> strategy is very likely to disappoint almost anyone.
> People embedding LT into their own applications already have full freedom,
> e.g. they can create their own rules and deactivate ours. I think this
> helps a lot.
Here my viewpoint is rather different:
I may want to incorporate dynamic dictionaries and thus I would prefer 
using hunspell over Morfologix where I would have to recompile the 
dictionary each time I modify it, but I do not really want to be forced 
to learn the whole rule-writing procedure (here the keyword is 
"forced").
For interactive use "ignore once", "ignore all" and "add to 
[user|doc-specific] dictionary" are very different concepts.
LT has great potentialities, but if the general interface is too 
restricted (with respect to plain spellchecker) it might be very 
difficult to integrate it into custom programs.

> LT is basically (at least) two things: a Java library and an application.
> If we manage to put those two in their own maven modules, this could help
> us to get a clearer picture of what needs to be done where.
That might be, but I think this really is a question of interfaces, 
while maven is "only" an automated build process .

I might have a very biased view, but I think if you focus too much on 
the application the result will be a library that is so specific it 
will be difficult to use it anywhere else.
IMHO You should focus on a very general library and then produce a few 
applications which "only" shoe what can be done with the lib.
Focus should be to decouple all linguistic knowledge (belonging to the 
lib/rule writers) from use cases (interactive/batch, 
auto-correct/flag-only, single or multi-language, single or 
multi-dictionary, etc. etc) that belong to the Application writer.

I would like to be able to incorporate the library into my app without 
any need to understand how it does its magic.
User should not be forced to know anything about rules. That's the 
domain of linguistics gurus.
I might be able to help somewhat there, but only for Italian!
My application could be internationalized and end up in the hands of a 
Polish (or German, English, French, ...) guy and I cannot provide 
support in all those languages (obviously!).
I must rely on a solid API and the trust on whoeve

[Languagetool] Status of rules for Italian

2012-09-22 Thread Mauro Condarelli
Hi,
I spent some time integrating LT into an Eclipse RCP application.
I also integrated plain Hunspell as a "proven good" alternative.

I am Italian and my focus, as said i primarily the Italian Language.

To do some testing I started with "well known" novels:
"I Promessi Sposi" results in 22861 errors in 1056 sentences, mostly due
to verb concordance and repetitions.
I tried with a modified version of grammar.xml (thanks to P. Bianchini!)
and errors dropped a little (22360).

Another large source of errors is (apparently) some error in the
tokenizer:  in Italian the apostrophe (') should be a word-character, so
expressions like: "dall'altra" or "all'occhio" should be treated as a
single entity. I am unsure if this is a LT problem or a deeper Hunspell one.

Notice I had these results after disabling some of the more
"problematic" rules and thus halving the error count.

I am willing to help fixing the rules, but I think I can be more useful
trying to implement some self-learning into LT.
I would like to discuss my ideas with someone who is in charge of LT
because I don't really know it and I do not know what are the plans for
the future.

My (currently very vague) proposal is:

 1. implement self-learning files to hold the data; there should be
three more-or-less equivalent files:
 1. global, intended to improve LT itself.
 2. personal, holding data "private" to the user.
 3. local, connected to the specific document being proofread.
 2. implement some kind of interface to flag as "non-errors" some
specific rule-matches. This is tricky because it should be resilient
to source document changes.
 3. subsequent runs should recognize the self-learning files and avoid
repeated errors.
 4. implement some way to use the self-learning data to improve
spell-checker and rules themselves.

I would like a comment to define if this can be accepted by LT community.

Regards
Mauro

--
How fast is your code?
3 out of 4 devs don\\\'t know how their code performs in production.
Find out how slow your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219672;13503038;z?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] Status of rules for Italian

2012-09-23 Thread Mauro Condarelli
Hi Daniel,
comments below.

On 23/09/2012 00:15, Daniel Naber wrote:
> On 22.09.2012, 19:32:29 Mauro Condarelli wrote:
>
> Hi Mauro,
>
>> Another large source of errors is (apparently) some error in the
>> tokenizer:  in Italian the apostrophe (') should be a word-character, so
>> expressions like: "dall'altra" or "all'occhio" should be treated as a
>> single entity. I am unsure if this is a LT problem or a deeper Hunspell
>> one.
> we could change the tokenizer for Italian and then re-build the dictionary, 
> as described here: http://languagetool.wikidot.com/hunspell-support - 
> unless Marcin sees a problem with that.
The above examples point to errors in Hunspell dictionary itself...
which strengthens my "harvesting" proposal (see below).

>> because I don't really know it and I do not know what are the plans for
>> the future.
> Great, this is the right place to discuss those ideas. We have a three-
> month release cycle (1.9 to be released in a week), so it would be useful 
> to fit it in there.
Ok. What I have in mind should fit into next release, given the presumed
development effort and the time I have available (this is a kind of a
hobby, for me, not part of my regular work).

>>  1. implement self-learning files to hold the data; there should be
>> three more-or-less equivalent files:
>>  1. global, intended to improve LT itself.
>>  2. personal, holding data "private" to the user.
>>  3. local, connected to the specific document being proofread.
> I'm not sure if I understand what those files should contain: are these 
> exceptions to the rules to avoid false alarms? Or just lists of rules to be 
> turned off?
The idea is to have several sections:

 1. Defaults: local overrides user and user overrides global. These
should include (at least):
 1. Language
 2. Mother language
 3. Active rules
 2. False positives: should include the whole phrase containing the
non-error
 1. spelling errors. these should go to enhance the dictionaries.
this is particularly important since building a complete
dictionary is a very long and tedious job, while adding a word
may be routinely done.
 2. grammar rules exceptions.
 3. False negatives: should include the whole phrase containing the
error along with error location and (possibly) a short description.
 1. essentially broken phrases not catched by LT.

LT could use (2) and (3) to refine spellchecking (simply as a
post-processing step).
If this is done consistently we could benefit by some kind of
net-harvesting by asking the users to send over their files for analysis
by the resident language maintainer and possible rule enhancement ( I
don't know if the process might be automatized, but I doubt it).
Easier is, of course, the tagger dictionary enhancement, but that alone
would make the process very useful in a relatively short period.

>>  2. implement some kind of interface to flag as "non-errors" some
>> specific rule-matches. This is tricky because it should be resilient
>> to source document changes.
I am currently, as said, integrating LT into my eclipse RCP application.
My plan is to enhance the "correct spelling error" pop-up to handle "add
to dictionary" (register in user) and "ignore" (add to local).

> That would be useful indeed.
>
>> I would like a comment to define if this can be accepted by LT community.
> It sounds useful (but see my question above). I'm not sure yet where the 
> parts belong - we're planning to split LT into Maven modules. The core 
> should be kept simple, and for LO/OO integration we must keep in mind that 
> we implement their simple grammar checker interface, so adding menu items 
> like "ignore here" or "ignore always" might not be easy.
I am not familiar with that interface, but my idea is we could provide a
richer API and then a thin layer wrapper around it mimicking whatever is
needed by LO/OO. This retains all potentiality and, if LO/OO can't use
the interface to it's fullest, this wouldn't cripple other interfaces.

> We can keep discussing this here, but if you want to make a real plan, feel 
> tree to add it to the Wiki at http://languagetool.wikidot.com/
Agreed.
I will study the wiki.
I need to understand better how the tagger dictionary works, in
particular in presence of words hat may have different taggings
depending on context (e.g.: "pesca", in Italian, may be a noun (peach)
or a verb (he fishes) or an adjective ("barca da pesca" -> fish boat)).
At a preliminary analysis of the false positives many could be some
mistagging  of this kind, 

Re: [Languagetool] rule "successive sentences begin with the same word"

2012-09-25 Thread Mauro Condarelli
Hi Paolo,
unfortunately that rule is completely wrong for Italian.
We should find a way to allow for exceptions like repeated adjective to
imply superlative ("presto presto" == "prestissimo") that is used quite
extensively by A.Manzoni ;)

Otherwise I concur using well established (and more recent) novels by
authors known to use a "good Italian style" should be much better test
than using wikipedia.

I took my copy of "Promessi Sposi" from LiberLiber and there are a lot
of other good sources there.

Regards
Mauro

On 25/09/2012 12:02, Paolo Bianchini wrote:
> Hi Daniel,
>
> I'm not an expert in English writing style, but I have a bit of difficulty in 
> looking at wikipedia as a source for good/correct writing. I would keep the 
> rule as it is.
>
> On the other hand, I have appreciated Mauro's test of feeding a well known 
> Italian novel into LT. I'm sure that Manzoni didn't use the same word to 
> start two consecutive sentences in the "Promessi Sposi" at all.
>
> Why not try to bring this approach a little bit further and define a set of 
> well recognized novels for each language to test LT?
>
> Ciao
>
> Paolo
>
>
>
> On Sep 25, 2012, at 11:50 AM, Daniel Naber wrote:
>
>> The "same word" rule causes a lot of alarms with our Wikipedia check[1]. 
>> Some are caused by the fact that the text extraction from Wikipedia is 
>> buggy. Still, I'm not sure how useful that rule is for long sentences. 
>> Should we disable it by default, or change it so that it only gets 
>> triggered by short sentences? I think for long sentences it doesn't affect 
>> style if they start with the same word. Opinions?
>>
>> [1] 
>> http://community.languagetool.org/corpusMatch/list?lang=en&filter=ENGLISH_WORD_REPEAT_BEGINNING_RULE
>>
>> -- 
>> http://www.danielnaber.de
>>
>>
>> --
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and 
>> threat landscape has changed and how IT managers can respond. Discussions 
>> will include endpoint security, mobile security and the latest in malware 
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> ___
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
> --
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] rule "successive sentences begin with the same word"

2012-09-25 Thread Mauro Condarelli
Sorry, I misread the original post.
My comment applies to rule ST_03_002 "Stile -> Leggibilità -> 
ripetizioni".
Mauro

On martedì 25 settembre 2012 13:30:17, Mauro Condarelli wrote:
> Hi Paolo,
> unfortunately that rule is completely wrong for Italian.
> We should find a way to allow for exceptions like repeated adjective to
> imply superlative ("presto presto" == "prestissimo") that is used quite
> extensively by A.Manzoni ;)
>
> Otherwise I concur using well established (and more recent) novels by
> authors known to use a "good Italian style" should be much better test
> than using wikipedia.
>
> I took my copy of "Promessi Sposi" from LiberLiber and there are a lot
> of other good sources there.
>
> Regards
> Mauro


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


[Languagetool] info request

2012-09-29 Thread Mauro Condarelli
Hi,
I'm coding my test app with LT and I'm almost ready to become productive.

I need a deeper insight in two areas (for now):

1) how the tagging dictionary is supposed to disambiguate multiple
meaning of some words;
reason is I need to understand if and how it is possible to make it better:
a sizable amount of false positives (in Italian at least) comes from
misinterpretation of POS tags
"cammino" in Italian is both a verb (I walk) and a noun (path), so the
phrase:
"alla fine del cammino mi riposai" (at the end of the path I rested) is
flagged in error
because "verbs don't agree" even though there is just one (true) verb.
It would be trivial to disambiguate this particular example because an
article cannot immediately
precede a verb thus "cammino" cannot be a verb. I don't know how to
teach this to LT.

2) I need a better insight in the dictionary structure (I think it is
based on the morfologik package,
but I don't know if it has been adapted for LT usage). I would like to
understand if it's possible to add
entries "on the fly".
I would also like to know if it's possible to add further information to
the words (e.g.: pointers to synonyms and antinonyms).

Who can point me in the right direction?

TiA
Mauro

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] info request

2012-09-30 Thread Mauro Condarelli
Hi, thanks.

Comments below.

On 30/09/2012 11:52, Marcin Miłkowski wrote:
> W dniu 2012-09-30 05:05, Mauro Condarelli pisze:
>> Hi,
>> I'm coding my test app with LT and I'm almost ready to become productive.
>>
>> I need a deeper insight in two areas (for now):
>>
>> 1) how the tagging dictionary is supposed to disambiguate multiple
>> meaning of some words;
>> reason is I need to understand if and how it is possible to make it better:
>> a sizable amount of false positives (in Italian at least) comes from
>> misinterpretation of POS tags
>> "cammino" in Italian is both a verb (I walk) and a noun (path), so the
>> phrase:
>> "alla fine del cammino mi riposai" (at the end of the path I rested) is
>> flagged in error
>> because "verbs don't agree" even though there is just one (true) verb.
>> It would be trivial to disambiguate this particular example because an
>> article cannot immediately
>> precede a verb thus "cammino" cannot be a verb. I don't know how to
>> teach this to LT.
> The tagging dictionary only introduces ambiguities. Only the 
> disambiguator disambiguates ;)
>
> There is no disambiguation.xml file for Italian (yet) but adding one 
> (with the corresponding code) is quite easy. Look for example at 
> DanishRuleDisambiguator.java and how it's used in the code (basically, 
> you need to add this class and modify Italian.java to implement 
> getDisambiguator() method).
Ok, understood.

I suggest to add specific instructions in the disambiguator Wiki page
about steps needed to add a Rule-based disambiguator to LT.

Something along the lines:

--
To create a new disambiguator:

*** in src/main/java:

===Create a new package org.languagetool.tagging.disambiguation.rules.xx
===Create in the new package a new Rule Disambiguator containing:

package org.languagetool.tagging.disambiguation.rules.xx;
import org.languagetool.Language;
import
org.languagetool.tagging.disambiguation.rules.AbstractRuleDisambiguator;
public class YyRuleDisambiguator extends AbstractRuleDisambiguator {
@Override
protected Language getLanguage() {
return Language.ZZ;
}
}

where:
xx is the two letter country code
Yy is the language name
ZZ is the static Language variable

===Open org.languagetool.Language.Yy.java and override
getDisambiguator():

@Override
public final Disambiguator getDisambiguator() {
if (disambiguator == null) {
disambiguator = new YyRuleDisambiguator();
}
return disambiguator;
}

*** in src/main/resources:

===Create file org/languagetool/resource/xx/disambiguation.xml
===Populate it
--

I tried with some very simple rules, but I'm far from sure I grok xml
meaning:

--


http://www.w3.org/2001/XMLSchema-instance";
xsi:noNamespaceSchemaLocation="../disambiguation.xsd">





















--

I was trying to say that if I have a VERB preceded by an ART or an
ARTPRE then the VER interpretation of the token should be removed
(usually such a token has multiple interpretations as NOUN and VER.

Is this the right syntax?
Is changing the .xml file and relaunching the app enough or should I
rebuild the app?

(sorry for the possibly stupid questions, but I'm still very confused).


>> 2) I need a better insight in the dictionary structure (I think it is
>> based on the morfologik package,
>> but I don't know if it has been adapted for LT usage). I would like to
>> understand if it's possible to add
>> entries "on the fly".
> Not to the morfologik dictionary but you can use the ManualTagger. You'd 
> need to add a method to add a word on the fly.
Where can I find info about such a thing?

>> I would also like to know if it's possible to add further information to
>> the words (e.g.: pointers to synonyms and antinonyms).
> This is not part-of-speech information, so it does not belong to the 
> tagger. But you can add a separate dictionary, and you could even have 
> Wordnet encoded as a finite-state machine for a very quick use 
> (basically, you'd need to prepare a perfect hash fsa file for words in 
> Italian, which is easy; and plan how to encode the Wordnet relationships 
> in a graph whose nodes are hash numbers). But this requires some more 
> coding.
Agreed.
Is there any information beyond what' in developing-a-tagger-dictionary
wiki page?
That is a recipe to build the compressed dicts, but it's not obvious how
to reuse fsa to build somet

Re: [Languagetool] info request

2012-10-02 Thread Mauro Condarelli
On domenica 30 settembre 2012 20:30:17, Marcin Miłkowski wrote:
 I would also like to know if it's possible to add further information to
 the words (e.g.: pointers to synonyms and antinonyms).
>>> This is not part-of-speech information, so it does not belong to the
>>> tagger. But you can add a separate dictionary, and you could even have
>>> Wordnet encoded as a finite-state machine for a very quick use
>>> (basically, you'd need to prepare a perfect hash fsa file for words in
>>> Italian, which is easy; and plan how to encode the Wordnet relationships
>>> in a graph whose nodes are hash numbers). But this requires some more
>>> coding.
>> Agreed.
>> Is there any information beyond what' in developing-a-tagger-dictionary
>> wiki page?
>> That is a recipe to build the compressed dicts, but it's not obvious how
>> to reuse fsa to build something different.
>> I will need to study it better.
>
> Ah, nobody used the dicts for such complex purposes (yet). That's why
> there's no detailed info about it. I'd need to think more to give more
> detailed specs. But overall, fsa dicts can be used for lots of purposes
> with very high performance.
I was unable to find generic documentation on morfologik-fsa package, I 
only found the javadoc API description (which does not give an overall 
picture) and a bunch of pages in Polish (which I can't read; I tried 
with google translate, but got nowhere).
Perhaps someone can point me in the right direction...

TiA
Mauro

--
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Italian Language enhancements

2012-12-27 Thread Mauro Condarelli

Hi,
I'm trying to use LT for Italian.
There are a lot of false positives in my language, so I started to look 
around to enhance the rules.


I found out many false-positives come from incorrect tagging (to be more 
precise: lack of disambiguation), so I tried to implement some very 
simple disambiguation.
Unfortunately it doesn't seem to work. At end of message you find my 
changes.


My first test sentence is:
"Prima di lasciarsi il tempo di pensare troppo raccolse zaino e bastone 
da viaggio e, con un lungo passo determinato, attraversò la soglia."


Test results are:
" Starting check in Italian...

*1. Line 1, column 47*
*Message:* Controllare il tempo dei verbi utilizzati nella frase. 
(deactivate) 
*Context:* ...di lasciarsi il tempo di pensare troppo 
*raccolse zaino e bastone da viaggio*e, con un lungo passo determinato, 
attr...


Potential problems found: 1 (time: 25ms)"

Which is absolutely wrong because the highlighted part contains just one 
verb ("raccolse").


Tagging gives:
"  Prima[primo/ADJ:pos+f+s, 
prima/ADV]di[di/PRE]lasciarsi[lasciare/VER:inf+pres+si]il[il/ART-M:s]tempo[tempo/NOUN-M:s]di[di/PRE]pensare[pensare/VER:inf+pres]troppo[troppo/ADV, 
troppo/ADJ:pos+m+s, 
troppo/DET-INDEF:m+s]raccolse[raccogliere/VER:ind+past+3+s]zaino[zaino/NOUN-M:s]e[e/CON]bastone[bastone/NOUN-M:s]da[da/PRE]viaggio[viaggio/NOUN-M:s, 
viaggiare/VER:ind+pres+1+s]e[e/CON],[,/PON]con[con/PRE]un[un/ART-M:s]lungo[lungo/ADJ:pos+m+s, 
lungo/PRE]passo[passo/NOUN-M:s, passo/ADJ:pos+m+s, 
passare/VER:ind+pres+1+s]determinato[determinato/ADJ:pos+m+s, 
determinare/VER:part+past+s+m],[,/PON]attraversò[attraversare/VER:ind+past+3+s]la[la/PRO-PERS-CLI-3-F-S, 
la/ART-F:s]soglia[soglia/NOUN-F:s, solere/VER:cond+pres+2+s, 
solere/VER:cond+pres+1+s, solere/VER:cond+pres+3+s].[./SENT, ]"


There's an ambiguity in the word "viaggio" which, taken alone, can be 
either a noun ("trip", the correct meaning in this case) or a verb ("I 
travel"), as correctly stated by tagging.
I assume this is the reason for the false positive; can someone confirm, 
please?


I thus tried to avoid this particular error by adding the disambiguating 
rules below.
What I wanted to say is: "PREposition or ARTicle cannot immediately 
preceded a VERb".


Obviously I goofed somewhere because it didn't work (the above results 
are *with* the changes).


Can someone help me, please?
TiA
Mauro


Index: src/main/java/org/languagetool/language/Italian.java
===
--- src/main/java/org/languagetool/language/Italian.java (revision 8680)
+++ src/main/java/org/languagetool/language/Italian.java(working copy)
@@ -32,11 +32,14 @@
 import org.languagetool.rules.WordRepeatRule;
 import org.languagetool.rules.it.MorfologikItalianSpellerRule;
 import org.languagetool.tagging.Tagger;
+import org.languagetool.tagging.disambiguation.Disambiguator;
+import 
org.languagetool.tagging.disambiguation.rules.it.ItalianRuleDisambiguator;

 import org.languagetool.tagging.it.ItalianTagger;

 public class Italian extends Language {

   private Tagger tagger;
+  private Disambiguator disambiguator;

   @Override
   public Locale getLocale() {
@@ -77,6 +80,14 @@
   }

   @Override
+  public final Disambiguator getDisambiguator() {
+if (disambiguator == null) {
+  disambiguator = new ItalianRuleDisambiguator();
+}
+return disambiguator;
+  }
+
+  @Override
   public Contributor[] getMaintainers() {
 final Contributor contributor = new Contributor("Paolo Bianchini");
 return new Contributor[] { contributor };
Index: 
src/main/java/org/languagetool/tagging/disambiguation/rules/it/ItalianRuleDisambiguator.java

===
--- 
src/main/java/org/languagetool/tagging/disambiguation/rules/it/ItalianRuleDisambiguator.java 
(revision 0)
+++ 
src/main/java/org/languagetool/tagging/disambiguation/rules/it/ItalianRuleDisambiguator.java 
(revision 0)

@@ -0,0 +1,32 @@
+/* LanguageTool, a natural language style checker
+ * Copyright (C) 2007 Daniel Naber (http://www.danielnaber.de)
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
+ * USA
+ */
+
+package org.languagetool.tagging.disambiguation.rules.it;
+
+import org.languageto

Re: Italian Language enhancements

2012-12-27 Thread Mauro Condarelli

On 27/12/2012 23:18, Jaume Ortolà i Font wrote:

2012/12/27 Mauro Condarelli mailto:mc5...@mclink.it>>

I thus tried to avoid this particular error by adding the
disambiguating rules below.
What I wanted to say is: "PREposition or ARTicle cannot
immediately preceded a VERb".


Hi Mauro,

Your disambiguation rule could be something like this (three in one):


   
   
   
 
   
   
  




 

THANKS! this works very well in my example list.

However, the postag="VER.*" probably should be more restrictive. If 
infinitives or participles are allowed after preposition, then you 
need to change the postag regexp, or to add exceptions as postag regexps.
Actually infinitives are ok because they cannot be also NOUNs, so Your 
and-rule filters them out.

I will have to check better for participles.

In Catalan there are similar rules. I invested a lot of effort into 
Catalan disambiguation, and I imagine that some of the strategies I 
have used can be useful for other Latin languages.
I had a look at Your Catalan disambiguation and the sheer size 
overwhelmed me ;)

I am pretty sure I can dig some useful hint there.
Real problem is I'm not a linguist, I merely wanted to use LanguageTool 
and got stuck with too many false positives, so I'm trying to help 
enhance Italian.

My understanding of what I'm doing is far from being complete.



Regards,
Jaume Ortolà

Regards
Mauro
--
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Italian Language enhancements

2012-12-27 Thread Mauro Condarelli

On 27/12/2012 22:53, Paolo Bianchini wrote:

Hi Mauro,

you are correct. The reason for the false positive resides in the 
ambiguity of the tagging generated for the word "viaggio".

Ok.

I cannot help you on the specific problem that you are facing with the 
disambiguation rule but I've some ideas on how to improve the verb 
tense rule. I'll try to work on it tomorrow.
Disambiguation is now working, so if You think it might be useful we can 
try to enhance it.

Any other improvement on that rule would be most welcome. Thanks.

The question is: is it better to have false positives or to miss some 
errors? Also, complicating the rules to avoid false positives raises 
elaboration time. We need to find a compromise.
My personal opinion is LanguageTool-Italian is almost unusable in its 
current state.
Before implementing disambiguator I had errors in almost *ALL* 
moderately complex phrases (more than 10 words). Needless to say they 
were all false positives.

With a very stupid disambiguation I got rid of about one half of the errors.
IMHO this is not nearly enough.
Target should be no more than a false positive per printed page.
Otherwise people would simply shut the thing off.

Also the computing-time-penalty problem is (again IMHO!) a false problem.
The time spent in hand-checking a single false positive is much more 
than the time a modern computer needs to do it's job.
OTOH I completely agree with the fact real mistakes should be detected 
as thoroughly as possible... for the same reason: hand-checking the 
whole document to find one un-flagged error costs much more than 
checking a few false positives.


As stated in the other reply (to Jaume Ortolà) I'm no linguist at all, 
even if I have a fair knowledge of Italian grammar and syntax.
I can thus help testing and programming (I'm currently coding some 
better spelling exception handling), but I really need help on the rule 
side because I have only a faint idea of what's going on there.



Ciao

Paolo

Best Regards
Mauro
--
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Italian Language enhancements

2012-12-28 Thread Mauro Condarelli

On 28/12/2012 08:41, Paolo Bianchini wrote:
Actually what I wanted to try is to add a check in the grammar rule 
that takes into account also the person in which the verb is used so 
that we check the tense of verbs only if they are in the same person.


Therefore, instead of adding the exception as Dominique was suggesting 
(that btw is something that we could try), the rule would match


raccolse[raccogliere/VER:ind+past+3+s]

with

viaggio[viaggio/NOUN-M:s,viaggiare/VER:ind+pres+1+s]

because they are not in the same person. I assume that if someone was 
to make a verb tense mistake in writing a sentence they would, at 
least, use the same person.

I disagree.
You would restrict the usefulness of the rule to cover for a different 
deficiency.
Here the right thing is to understand "viaggio" should NOT be 
interpreted as a VERb at all.


Any suggestion on how this could be achieved? I guess that it would be 
much easier in Java than with regexps
Unfortunately I am not a linguist and I have problems understanding what 
are the "right" rules to use for disambiguation.


What I can say is rule GR_10_001 is by far the foremost source of false 
positives in literature. I checked using random chapters of 
well-established novels. OTOH disabling it isn't an option because wrong 
tens concordance is a major cause of error (especially when subjunctive 
is involved!).


What I don't know is how to build a comparable corpus of *wrong* Italian 
sentences. I will try to devise something. Building wrong phrases myself 
should be last resort because those wouldn't be representative of real 
errors "in the wild".



Thanks

Paolo


My disambiguation rule needs updating, if someone can suggest how.

Mario gli chiese l'ora.

121 rules activated for language Italian

 Mario[Mario/NPR]  gli[gli/PRO-PERS-CLI-3-M-S,il/ART-M:p]  chiese[chiesa/NOUN-F:p]  
l[l]'['/PON]ora[orare/VER:impr+pres+2+s,orare/VER:ind+pres+3+s,ora/ADV,ora/NOUN-F:s].[./SENT,]

Disambiguator log:

art-ver: chiese[chiesa/NOUN-F:p,chiedere/VER:ind+past+3+s] -> 
chiese[chiesa/NOUN-F:p]

1.) Line 1, column 7, Rule ID: GR_02_001[2]

Message: L'articolo non concorda: 'le chiese'.

Suggestion: le chiese

Mario gli chiese l'ora.

  ^^


Here problem is "gli" is not an ARTicle, but a PROnoun, thus the rule 
should not apply.
In the same sentence "l'" is not recognized as ARTicle ("la") and thus 
the rule is not applied to the following "ora", while it should have.


Can someone suggest how to improve this disambiguation rule to cover the 
currently known cases?


TiA
Mauro
--
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


null char

2012-12-28 Thread Mauro Condarelli
Hi,
I have some markup in the text I submit to LT.
This needs to be removed before submission, of course.
Problem is if I remove all markup then I also need to keep track of char 
positions and that might not be easy (markup is 2, 2 or 3 char 
interspersed in the phrase).
Is it possible to substitute markup with some "null char" to be 
completely ignored by LT (thus preserving the position of "meaningful" 
chars)?
I tried with white-space, but I get an error about multiple spaces. As 
last resort I could disable it, but that warning might be useful.
Is there any other option?

TiA
Mauro

--
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Italian Language enhancements

2012-12-28 Thread Mauro Condarelli

On 29/12/2012 00:24, Jaume Ortolà i Font wrote:
2012/12/28 Paolo Bianchini >


I'm interested in learning if other languages have a rule to check
if there's concordance of tense among verbs with sentences. Any
input or suggestion?


I don't think that "concordance of tense" can produce useful rules. I 
have taken a look at the GR_10_001 rules group ("concordanza tempi 
delle coordinate"), and my impression is that it will generate 
unfailingly a lot of false alarms.


Yet I plan to write rules in Catalan to check some wrong combinations 
of tenses in dependent clauses.


Jaume


I am developing the feeling we sourly miss a robust disambiguation set 
of rules for Italian.
I took a look to Your Catalan rules and I must say I understand only a 
tiny fraction of what is there.


Almost all false positives come from misinterpreting some word.
In most cases they are flected verbs clashing with nouns or other POS, 
but there are also a fair number of articles clashing with pronouns 
(e.g.: "gli" is both an ARTicle:m-p and a PROnoun meaning: "to him" or 
"to them").
In all clashing cases it's easy to have rules matching "wrong" meaning 
and thus rising unwanted alarms.


Unfortunately, not being a linguist, I have only very vague ideas on how 
to tackle the problem.


Regards
Mauro
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122912___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Italian Language enhancements

2012-12-28 Thread Mauro Condarelli

On 29/12/2012 01:17, Jaume Ortolà i Font wrote:

2012/12/28 Mauro Condarelli mailto:mc5...@mclink.it>>

My disambiguation rule needs updating, if someone can suggest how.

Mario gli chiese l'ora.

121 rules activated for language Italian

 Mario[Mario/NPR]  gli[gli/PRO-PERS-CLI-3-M-S,il/ART-M:p]  chiese[chiesa/NOUN-F:p]  
l[l]'['/PON]ora[orare/VER:impr+pres+2+s,orare/VER:ind+pres+3+s,ora/ADV,ora/NOUN-F:s].[./SENT,]

Disambiguator log:

art-ver: chiese[chiesa/NOUN-F:p,chiedere/VER:ind+past+3+s] -> 
chiese[chiesa/NOUN-F:p]

1.) Line 1, column 7, Rule ID: GR_02_001[2]

Message: L'articolo non concorda: 'le chiese'.

Suggestion: le chiese

Mario gli chiese l'ora.

   ^^


Here problem is "gli" is not an ARTicle, but a PROnoun, thus the
rule should not apply.
In the same sentence "l'" is not recognized as ARTicle ("la") and
thus the rule is not applied to the following "ora", while it
should have.


Hi Mauro,

Catalan has exactly the same kind of ambiguities. I have (more or 
less) solved them, but it is quite complicated. Now I can tell some of 
the ideas used:


- Number and gender concordance/non concordance is used to keep or 
discard interpretations.
- Proximity of two consecutive verbs or two consecutive nouns is used 
to discard interpretations.

- Tags for "nominal groups" and "verbal groups" are applied.
- Concordances of 4- and 3-tokens patterns are given more "weight" 
than 2-tokens patterns. Example: "La porta bianca" 
(article-noun-adjective in concordance) is more probably a nominal 
group than "la porta" (article-noun or pronoun-verb).

- Etcetera.

I need to put in order the Catalan disambiguation file, and then see 
what can be directly used in other languages.


In your example, if the disambiguation rule you are using is the one I 
wrote before, then you need to add an exception:



 
 postag="PRO.*" postag_regexp="yes">

 
   
 
 





 

Now, taking into account that "gli" (ART-M:p) doesn't agree with 
"chiese" (NOUN-F:p) you could discard the article-noun interpretation, 
and keep the pronoun-verb interpretation. You should familiarize 
yourself with "unification" in order to write such rules:

http://languagetool.wikidot.com/using-unification

Regards,
Jaume Ortolà


Thanks Jaume.
I modified my rule as follows:




postag_regexp="yes">




postag_regexp="yes">




postag_regexp="yes">




... only slightly different from what You posted.
I *think* I understand what You suggested about "unification".
What I currently have no idea about is how to handle "probabilities".
I will try to come up with something tomorrow.
Judging from the length of Your .xml I fear it will be a looong fight.

Thanks again.
Mauro
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122912___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: missing translations for LT 2.0

2012-12-29 Thread Mauro Condarelli
On 29/12/2012 14:42, Daniel Naber wrote:
> Hi,
>
> the following languages don't have 100% translation coverage yet:
>
> Romanian, French, Greek, Chinese, Italian, Japanese, Khmer, Slovak
>
> Today and tomorrow is the last change to fix it. If you can do so, please
> go to
> https://www.transifex.com/projects/p/languagetool/resource/messagesbundleproperties/
> and add some translations (usually only very few translations are missing).
>
> Languages that didn't have updates for quite some time are Czech, Swedish,
> Icelandic, and Lithuanian. Updates for those are also welcome.
>
> Regards
>   Daniel
>
I registered with Transifex as "mc5686" and joined the Italian team.
Before I can help translating someone needs to enable me.

Regards
Mauro

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122912
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: missing translations for LT 2.0

2012-12-29 Thread Mauro Condarelli

On 29/12/2012 17:11, Daniel Naber wrote:

On 29.12.2012, 17:02:24 Mauro Condarelli wrote:


I registered with Transifex as "mc5686" and joined the Italian team.
Before I can help translating someone needs to enable me.

I just did that.

Regards
  Daniel

Thanks.
I added the missing translations to "*LanguageTool Core and User 
Interface*".
I did not tackle the firefox extension translations because IMHO 
LT-italian is not ready for prime time.

I am working to improve disambiguation.
I fear it will have to wait for next release.
Current status of java code is ok, disambiguation.xml rules are far from 
being complete.


Regards
Mauro
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122912___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


disambiguator blues

2012-12-29 Thread Mauro Condarelli
Hi,
I'm trying to understand how rules (specifically unification rules) work.

What I'm trying to do is to say something like:
If I have a  token which may be an article followed by a token which may 
be a noun or a verb...
   If article and noun do not have the same gender and number...
   then the second token is a verb (it would be better to say it is not 
a noun).

The rule I wrote is:

 

 

 

 

 



 

 



 

 

 

 

 


This do not work (and several other attempts to achieve the same result 
did not work).
Test phrase is:

Mario gli chiese l'ora.

121 rules activated for language Italian

 Mario[Mario/NPR]  gli[gli/PRO-PERS-CLI-3-M-S,il/ART-M:p]  
chiese[chiesa/NOUN-F:p,chiedere/VER:ind+past+3+s]  
l[l]'['/PON]ora[orare/VER:impr+pres+2+s,orare/VER:ind+pres+3+s,ora/ADV,ora/NOUN-F:s].[./SENT,]

Disambiguator log:

2.) Line 2, column 7, Rule ID: GR_02_001[2]

Message: L'articolo non concorda: 'le chiese'.

Suggestion: le chiese

Mario gli chiese l'ora.

   ^^



Wiki pages did not help.
Can someone help me understand how this thing is supposed to work, please?

TiA
Mauro

P.S.: I defined the unification tags as follows:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122912
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: making XML rules more compact?

2012-12-30 Thread Mauro Condarelli
Just my 2c...

IMHO there's no gain (aside from the editor problem, but that seems more 
easily solved switching editor!) in obfuscating the code by keeping the 
same xml-like structure and shrinking the tags.

If You want to really gain something we could design a suitable DSL and 
implement it.
I recently did that for a customer using Eclipse Xtext.
I'm pretty sure we could come up with something in a very reasonable 
amount of time.
This would lock us to eclipse, but you would get a 
syntax-highlighting/error-checking editor almost for free and it would 
be possible to automatically generate code from the rules, thus doing a 
kind of compilation as opposed to current interpretation of the rules.

Since rules effectively are productions something like BNF could be more 
straightforward, but there I would ask to people with more experience in 
rule-writing.
As a newbie I can tell You I have a lot of difficulties forcing myself 
thinking xml while the task at hand is production-oriented, but that 
might well be my limit.
I am available to help, if You decide to try this way.
First step would be to decide the grammar of the new DSL.

Regards
Mauro


On 30/12/2012 22:34, Daniel Naber wrote:
> On 30.12.2012, 22:15:22 Marcin Milkowski wrote:
>
>> What is the problem you are trying to solve?
> Indeed my editor cannot properly edit the files anymore. But the redundancy
> also makes rules harder to read and to write (I guess most people copy
> existing rules to create new ones though).
>
>> adding exceptions would be a nightmare in the new syntax scheme.
> I wouldn't do that, that syntax variant would only apply for simple rules
> that don't need exceptions.
>
> Regards
>   Daniel
>


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_123012
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: making XML rules more compact?

2012-12-31 Thread Mauro Condarelli
Hi All,

On 31/12/2012 12:48, Mike Unwalla wrote:
> Hello,
>
> Readability is more important than decreasing the size of a file. In my
> opinion, Step 1 and Step 3 decrease readability. ' is clearer than
> ''.
I completely agree with the above.

The point I was trying to make is xml doesn't look suited to describe a 
set of production rules for text transformation (disambiguator) or 
syntax check (grammar).

In such a case it's common to devise a DSL (Domain Specific Language) 
precisely describing the problem and thus enhancing manifold readability 
and maintainability.
The downside of this approach is the need to build a complete toolchain 
for the new language, including a suitable editor and a compiler.

I was pointing out eclipse includes all tools to easily do all necessary 
framework with very little effort (actually little more than writing the 
BNF grammar for the DSL itself). This can be deployed into eclipse 
itself (as a plugin) or wrapped in a stand-alone "RCP" application 
acting as a (very fat) editor (complete with syntax-highlighting, 
on-the-fly error detection and auto-completion) for the language files 
that, as a "side effect" produces also some suitable representation of 
the semantic. This "suitable representation" could be in the form of 
compilable java classes (for speed) or even the current xml syntax (for 
compatibility).

Regards
Mauro
> In a related reply, Dominique wrote:
>  It will only marginally reduce size. But shorter add less noise
>  so it's clearer in my opinion.  and  may look less readable
>  than  and  but since rule developers
>  use them all the time, they would be well familiar with them.
>
> I do not create rules each day. Typically, I work with LT each day for 2 or
> 3 weeks. Then, I work on other projects for weeks or months.
>
> Regards,
>
> Mike Unwalla
> Contact: www.techscribe.co.uk/techw/contact.htm
>
>
> -Original Message-
> From: Daniel Naber [mailto:list2...@danielnaber.de]
> Sent: 30 December 2012 20:56
> To: development discussion for LanguageTool
> Subject: making XML rules more compact?
>
> Hi,
>
> we have three languages with grammar files that are more than 1 MB large
> (German, French, Catalan). The German grammar.xml has more than 24,000
> lines. This size makes editing the files difficult. I have some ideas on how
>
> to improve the situation and I'm looking for other ideas and comments:
>
> Step 1 - the easy one
>
> We can make the syntax a bit more compact and readable by changing some
> elements:
>
>  => 
>  => 
>  => 
>  => 
>
>
> Step 2 - less repetition (also easy to implement)
>
> The contents of , , and  should be inherited from a
>  element to its  elements. This way those elements do not
> need to be repeated if the are the same for all rules of a rulegroup.
>
>
> Step 3 - an XML-free pattern
>
> Add a compact way to describe simple patterns. This is best explained by
> example. What is now this:
>
> 
>foo|bar
>
>  myerror
>
> 
>
> ...could be written like this:
>
> re:foo|bar _myerror_
>
> Thus you don't need "" at all as a whitespace implies a token
> boundary. The prefix "re:" turns on regular expression matching (the same
> for "pos:" -> POS tag, "pos:re:" -> POS tag regex). "" is replaced
> by underscores. This does not support exceptions and other advanced
> features, but it turns a 6-line rule into a 1-line rule. This new syntax is
> optional, i.e. the old one can still be used.
>
> What do you think about that? Other suggestions for making rule syntax more
> compact?
>
> Regards
>   Daniel
>
>
> --
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. SALE $99.99 this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122412
> ___
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122412
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: making XML rules more compact?

2012-12-31 Thread Mauro Condarelli
Hi Marcin,
sorry, I seem unable to express myself clearly.
What I mean is the following:

1) Design a language fitted to our purposes, completely different from xml.

I am not really sure about what form this language should have, but, as 
an example, you could think something like:

od_check := 'od' "2.startsWith([aeiu])" => "L'uso della 'd' eufonica 
dovrebbe essere limitato ai casi di incontro della stessa vocale" -> 'o' 
"2.getToken()";

here:
things in single quote stand for the token itself
things in double quote are functions of the token
at first is the name of the rule
:= introduces the matching part
=> introduces a message to be displayed to the user
-> introduces a possible replacement
; closes the rule
a naked number in double quote stands for the Nth token.

The rule name could be used elsewhere as token

2) Write an EBNF for this new language.

this would be done together with (1), of course.

3) At this point it is easy, using facilities provided by Xtext to have 
a specific editor for the new language and an AST for the rules written.

4) Write a "code generator" that walks the AST and generates the current 
xml file.
This would allow to retain all current code and to continue writing .xml 
entering from the "back door".

the above rule could be translated to:


od
[aieu].*

L'uso della 'd' eufonica dovrebbe essere limitato ai casi di 
incontro della stessa vocale: o .


5) Write a different code generator that, walking the same AST as in (4) 
produces a java class implementing the rule behavior (skipping xml 
encoding/decoding)

this could be something like:

public AnalyzedTokenReadings[] atr od_check(AnalyzedTokenReadings[] atr) {
if (atr.size <2)
return null;
AnalyzedTokenReadings t1 = atr[0];
if (!t1.getToken().equals("od"))
return null;

AnalyzedTokenReadings t2 = atr[1];
if (!t1.getToken().matches("[aieu].*"))
return null;

String suggestion = "o" + " " + t2.getToken();
raiseError("L'uso della 'd' eufonica dovrebbe essere limitato ai casi di 
incontro della stessa vocale", new String[] { suggestion });

return new AnalyzedTokenReadings[] { t1, t2 };
}

Obviously I chose a particularly simple example because I did not really 
want to design the DSL ;)
I also (still) have only a vague idea of what happens under the hood 
when rules are matched and about the full capabilities of the rules 
themselves.

I understand this proposal is a quite radical paradigm shift, even if we 
can implement it incrementally, but I believe that, if done right, it 
could speed-up rule development significantly.

Regards
Mauro

On 31/12/2012 14:44, Marcin Miłkowski wrote:
> Hi,
>
> W dniu 2012-12-31 14:01, Mauro Condarelli pisze:
>> Hi All,
>>
>> On 31/12/2012 12:48, Mike Unwalla wrote:
>>> Hello,
>>>
>>> Readability is more important than decreasing the size of a file. In my
>>> opinion, Step 1 and Step 3 decrease readability. ' is clearer than
>>> ''.
>> I completely agree with the above.
>>
>> The point I was trying to make is xml doesn't look suited to describe a
>> set of production rules for text transformation (disambiguator) or
>> syntax check (grammar).
>>
>> In such a case it's common to devise a DSL (Domain Specific Language)
>> precisely describing the problem and thus enhancing manifold readability
>> and maintainability.
>> The downside of this approach is the need to build a complete toolchain
>> for the new language, including a suitable editor and a compiler.
>>
>> I was pointing out eclipse includes all tools to easily do all necessary
>> framework with very little effort (actually little more than writing the
>> BNF grammar for the DSL itself).
> Well, I'm not sure if this will be so easy, as conversion of XML
> languages into BNF is not a completely trivial business. There are no
> standard converters between XML Schema and BNF, for example, and I'm not
> sure if XSD is context-free just like BNF. It might be higher in
> Chomsky's hierarchy because it allows for some context-sensitivity in
> element names and regular expressions on the right-hand side of
> productions... I'm not sure how much of this is actually used in our
> .xsd files.
>
>   > This can be deployed into eclipse
>> itself (as a plugin) or wrapped in a stand-alone "RCP" application
>> acting as a (very fat) editor (complete with syntax-highlighting,
>> on-the-fly error detection and auto-completion) for the language files
> We already have XML editors that do that, and more.
>
>> that, as a "side effect" produces also some suitable representation of
>>

Dinamic dictionary handling

2013-01-19 Thread Mauro Condarelli
Hi,
I started coding to add dictionary handling to LT.
Currently I have multi-dictionary capability and I (slightly) modified 
MorfologikSpellerRule to accept without further action words having POS 
tags.
I would like to submit present code and get feedback before doing 
further work.
How should I proceed?

Mauro

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122912
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Dinamic dictionary handling

2013-01-19 Thread Mauro Condarelli
On 19/01/2013 17:04, Daniel Naber wrote:
> On 19.01.2013, 16:28:42 Mauro Condarelli wrote:
>
> Hi Mauro,
>
>> I would like to submit present code and get feedback before doing
>> further work.
> please send your code as a patch file (created with "svn diff") to this list.
>
>
here they come.
Patches include two distinct areas:
1) implementation of Italian disambiguator (with token rules)
2) Implementation of MultiTagger and its usage in Italian tagging.

Regards
Mauro

==

Index: src/main/java/org/languagetool/language/Italian.java
===
--- src/main/java/org/languagetool/language/Italian.java (revision 9109)
+++ src/main/java/org/languagetool/language/Italian.java (working copy)
@@ -31,11 +31,14 @@
import org.languagetool.rules.WordRepeatRule;
import org.languagetool.rules.it.MorfologikItalianSpellerRule;
import org.languagetool.tagging.Tagger;
+import org.languagetool.tagging.disambiguation.Disambiguator;
+import 
org.languagetool.tagging.disambiguation.rules.it.ItalianRuleDisambiguator;
import org.languagetool.tagging.it.ItalianTagger;

public class Italian extends Language {

private Tagger tagger;
+ private Disambiguator disambiguator;

@Override
public String getName() {
@@ -71,6 +74,14 @@
}

@Override
+ public final Disambiguator getDisambiguator() {
+ if (disambiguator == null) {
+ disambiguator = new ItalianRuleDisambiguator();
+ }
+ return disambiguator;
+ }
+
+ @Override
public Contributor[] getMaintainers() {
final Contributor contributor = new Contributor("Paolo Bianchini");
return new Contributor[] { contributor };
Index: 
src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
===
--- 
src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
 
(revision 9109)
+++ 
src/main/java/org/languagetool/rules/spelling/morfologik/MorfologikSpellerRule.java
 
(working copy)
@@ -20,6 +20,7 @@
package org.languagetool.rules.spelling.morfologik;

import org.languagetool.AnalyzedSentence;
+import org.languagetool.AnalyzedToken;
import org.languagetool.AnalyzedTokenReadings;
import org.languagetool.JLanguageTool;
import org.languagetool.Language;
@@ -78,11 +79,17 @@
return toRuleMatchArray(ruleMatches);
}
}
+ skip:
for (AnalyzedTokenReadings token : tokens) {
final String word = token.getToken();
if (ignoreWord(word) || token.isImmunized()) {
continue;
}
+ for (AnalyzedToken at : token.getReadings()) {
+ if (!at.hasNoTag())
+ continue skip; // if it HAS a POS tag then it is a known word.
+ }
+
if (tokenizingPattern() == null) {
ruleMatches.addAll(getRuleMatch(word, token.getStartPos()));
} else {
Index: src/main/java/org/languagetool/tagging/BaseTagger.java
===
--- src/main/java/org/languagetool/tagging/BaseTagger.java (revision 9109)
+++ src/main/java/org/languagetool/tagging/BaseTagger.java (working copy)
@@ -56,9 +56,6 @@
@Override
public List tag(final List sentenceTokens)
throws IOException {
- List taggerTokens;
- List lowerTaggerTokens;
- List upperTaggerTokens;
final List tokenReadings = new 
ArrayList();
int pos = 0;
// caching IStemmer instance - lazy init
@@ -70,32 +67,37 @@
for (String word : sentenceTokens) {
final List l = new ArrayList();
final String lowerWord = word.toLowerCase(conversionLocale);
- taggerTokens = asAnalyzedTokenList(word, dictLookup.lookup(word));
- lowerTaggerTokens = asAnalyzedTokenList(word, 
dictLookup.lookup(lowerWord));
final boolean isLowercase = word.equals(lowerWord);

//normal case
- addTokens(taggerTokens, l);
+ {
+ List taggerTokens;
+ taggerTokens = asAnalyzedTokenList(word, dictLookup.lookup(word));
+ addTokens(taggerTokens, l);
+ }

if (!isLowercase) {
//lowercase
+ List lowerTaggerTokens;
+ lowerTaggerTokens = asAnalyzedTokenList(word, 
dictLookup.lookup(lowerWord));
addTokens(lowerTaggerTokens, l);
}

//uppercase
- if (lowerTaggerTokens.isEmpty() && taggerTokens.isEmpty()) {
- if (isLowercase) {
- upperTaggerTokens = asAnalyzedTokenList(word,
- dictLookup.lookup(StringTools.uppercaseFirstChar(word)));
- if (!upperTaggerTokens.isEmpty()) {
- addTokens(upperTaggerTokens, l);
- } else {
- l.add(new AnalyzedToken(word, null, null));
- }
- } else {
- l.add(new AnalyzedToken(word, null, null));
+ if (isLowercase && l.isEmpty()) {
+ List upperTaggerTokens;
+ upperTaggerTokens = asAnalyzedTokenList(word,
+ dictLookup.lookup(StringTools.uppercaseFirstChar(word)));
+ if (!upperTaggerTokens.isEmpty()) {
+ addTokens(upperTaggerTokens, l);
}
}
+
+ //still empty? last resort...
+ if (l.isEmpty()) {
+ l.add(new AnalyzedToken(word, null, null));
+ }
+
tokenReadings.add(new AnalyzedTokenReadings(l, pos));
pos += word.length();
}
Index: src/main/java/org/languagetool/tagging/MultiTagger.java
===

Re: Dinamic dictionary handling

2013-01-19 Thread Mauro Condarelli
On 20/01/2013 01:45, Dominique Pellé wrote:
> Mauro Condarelli wrote:
>
>> Currently I have multi-dictionary capability and I (slightly) modified
>> MorfologikSpellerRule to accept without further action words having POS
>> tags.
> Hi Mauro
>
> We need to be able to turn this on/off per language.
> Is this the case?
>
> What you describe will be useful in Breton at least, where the dictionary
> for POS tag has some good words which are not in Hunspell.
>
> In Esperanto, it will not work at all because the POS tagger is not
> dictionary based. Some of the words which have a POS tag can
> still be considered as a typo. It may seem strange but the Esperanto
> Hunspell has many missing words: it's hard to list all valid words
> in Esperanto because it's an agglutinative language. But because
> the language is regular, instead of using a dictionary, the Esperanto
> tagger can use an algorithm based on word endings: words ending
> in *o are nouns, *oj are plural nouns, *a are adjectives, *e are
> adverbs, etc.
>
> In French, I will also turn it off, because the POS tag dictionary
> and Hunspell are based on the same dictionary (http://www.dicollect.org),
> but they have different tokenization. Tokenization for Hunspell for
> example does not split on apostrophe so "l'haricot" is recognized
> as typo. But for grammar checking, it is split on the apostrophe.
> So ignoring typos for words that have POS will ignore valid typos
> in French such as: L'haricot. There is nothing to gain with this
> change anyway for French because the Hunspell dictionary is very
> good.
>
> Regards
> Dominique
>
This needs to be discussed a bit before I proceed.
To clarify:
Current patch can't be disabled per-language.
It would not be a problem to modify it in order to disable it.
Reason for the patch is I'm building a Multi-tier tagger based on both 
BaseTagger and ManualTagger.
General idea is to have tree possible tagging dictionaries:
1) The global language dictionary.
2) An User Dictionary.
3) A dictionary specific for the file being processed.

this is mainly useful for proper names, including places, neologism and 
foreign words we might want to intersperse in the checked text, but 
could also be a good way to improve the standard dictionary if we could 
ask users to send over their "improvements".
It would be possible (and I plan to implement) full update of main 
tagger dict with "user suggestions"

I plan to add some API to dynamically manage these dictionaries.
Options should be:
a) Ignore for the current session. Nothing is saved on disk.
b) save as:
   i) local word used in the document (or group of documents or 
application using LT)
   ii) User specific
   iii) Global: this is a real word not covered by current tagger: add it!
This means I can have up to seven tagging dictionaries currently active.
On the other hand the current hunspell-based strategy is static, for 
this reason I need a way to know if the word has already been found 
somewhere or if a further check is in order.
I chose to use the presence of a POS tag and hence the patch, but I'm 
open to suggestions.

Two possible course of action come to mind:
either add Yet Another Flag to AnalyzedTokenReadings (e.g.: spellOk)
or recheck all dictionaries in MorfolgikSpellRule (this seems really an 
overkill, especially for all languages where tokenizer and speller are 
based on the same dictionary).

Please advise.
Mauro

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_123012
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Maven?

2013-01-22 Thread Mauro Condarelli

On 22/01/2013 14:18, Marco A.G.Pinto wrote:

Hello!

I am a bit confused and worried with the new changes going on.

Is Maven similar to Tortoise?

Can I still use Tortoise or must I switch to Maven? If so, is there a 
Windows version of it?


Thanks!

Kind regards,
  >Marco A.G.Pinto
Maven is a build system (think "ant") tied together with module testing 
and reusable module repository.

It is not a full fledged SCM.

I'm not sure about Daniel's plans, but my guess is we will continue 
using SVN (and thus TortoiseSVN, if You like) as Source Code Manager, 
but the content of the project will be refactored somewhat to take 
advantage of maven modular nature.

AFAIK maven will be used to compile/test/deploy LT.

Please have a look to: 
"http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html";

It should clarify maven purpose.
You will see current project structure is not very far from maven 
"recommended" one.


Regards
Mauro
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: switching to Maven - done!

2013-01-27 Thread Mauro Condarelli
On 24/01/2013 21:27, Daniel Naber wrote:
> On 24.01.2013, 13:15:38 Jaume Ortolà i Font wrote:
>
>> I can run the GUI with a file named now
>> "languagetool-standalone-2.1-SNAPSHOT.jar", but I had to unzip it
>> previously.  And I cannot find a command-line aplication like the
>> previous "LanguageTool.jar". Must I write one myself?
> You can use this for now (I just made an update, the class was still missing):
> java -cp languagetool-standalone-2.1-SNAPSHOT.jar 
> org.languagetool.commandline.Main
>
> We can either add script files or configure Maven to create
> another JAR for the command line version. Help is welcome, my list of
> "post Maven switch" TODOs is still quite long.
>
Sorry to disturb, people.
I've been using Eclipse previously.
Now I followed instructions for the maven repack.
Everything went ok, but I can't start the commandline:

mcon@vmrunner:/srv/Store/Language/languagetool/languagetool-standalone/target$ 
java -cp languagetool-standalone-2.1-SNAPSHOT.jar 
org.languagetool.commandline.Main
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/languagetool/Language
Caused by: java.lang.ClassNotFoundException: org.languagetool.Language
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: org.languagetool.commandline.Main. 
Program will exit.
mcon@vmrunner:/srv/Store/Language/languagetool/languagetool-standalone/target$

I assume something is missing in the classpath.
Obviously I checked and org.languagetool.commandline.Main class *is* in 
the .jar, but tons of other things are missing.
Can someone, pretty please, post a complete and working classpath? (if 
relevant: I'm using a Ubuntu 12.4)

TiA
Mauro

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: switching to Maven - done!

2013-01-28 Thread Mauro Condarelli

On 28/01/2013 09:51, Jaume Ortolà i Font wrote:

2013/1/28 Mauro Condarelli mailto:mc5...@mclink.it>>

Sorry to disturb, people.
I've been using Eclipse previously.
Now I followed instructions for the maven repack.
Everything went ok, but I can't start the commandline:


mcon@vmrunner:/srv/Store/Language/languagetool/languagetool-standalone/target$
java -cp languagetool-standalone-2.1-SNAPSHOT.jar
org.languagetool.commandline.Main
Exception in thread "main" java.lang.NoClassDefFoundError:
org/languagetool/Language


Hi Mauro,

I think you have to run the files not from the "target" folders but 
from your local maven repository.


By default, Maven local repository is:
Unix/Mac OS X -- ~/.m2
Windows -- C:\Documents and Settings\{your-username}\.m2

Jaume


Thanks Jaume,
Something is still missing.
My .m2/repository directory contains a lot of libs (.jars & .pom) deeply 
nested, but nothing really about LanguageTool itself.
I can try to patch a classpath myself, but I strongly suspect maven 
should be able to help.

Actual classpath will have  to be a mix of the .m2/repository AND target.
Can someone point me in the right direction, please?

TiA
Mauro
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel