[Languagetool] Initial hunspell support
Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. There are also some things we need to decide: - Should we add the country variant support right away? Right now the code in HunspellRule simply takes the first country variant declared in getCountryVariants(). The dictionary files needs to be called short_lang_code_country_variant - otherwise, they won't be found. We can make it more flexible, I only hacked the initial code to make it running and to see the performance impact. - hunspell does not strip punctuation by itself, I had to add some code for single-character punctuation marks. If there is a longer punctuation mark that should be ignored, let me know. - I didn't split the build yet at all. The ant file is left without any changes, but we should probably split into two distribution files, as we don't want to bundle all languages with the LibreOffice extension. Also, the JNLP version won't work with hunspell right now just because of this. - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. Sorry, it may all blow up. Probably they should be changed but the hunspell library expects a filepath anyway, so we need to deal with absolute filepaths. - I didn't add any new messages, you will see Possible typo translations everywhere. Something different should be added to the messages file. - To add hunspell support, you simply need to mention HunspellRule.class in your language class. Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
Marcin, I must have missed some info. What is the Hunspell integration meant to do exactly? Ruud On 18-05-12 15:10, Marcin Miłkowski wrote: Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. There are also some things we need to decide: - Should we add the country variant support right away? Right now the code in HunspellRule simply takes the first country variant declared in getCountryVariants(). The dictionary files needs to be called short_lang_code_country_variant - otherwise, they won't be found. We can make it more flexible, I only hacked the initial code to make it running and to see the performance impact. - hunspell does not strip punctuation by itself, I had to add some code for single-character punctuation marks. If there is a longer punctuation mark that should be ignored, let me know. - I didn't split the build yet at all. The ant file is left without any changes, but we should probably split into two distribution files, as we don't want to bundle all languages with the LibreOffice extension. Also, the JNLP version won't work with hunspell right now just because of this. - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. Sorry, it may all blow up. Probably they should be changed but the hunspell library expects a filepath anyway, so we need to deal with absolute filepaths. - I didn't add any new messages, you will see Possible typo translations everywhere. Something different should be added to the messages file. - To add hunspell support, you simply need to mention HunspellRule.class in your language class. Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
W dniu 2012-05-18 17:01, Ruud Baars pisze: Marcin, I must have missed some info. What is the Hunspell integration meant to do exactly? Find spelling mistakes. We have a Java rule that uses Hunspell for this purpose now. Marcin Ruud On 18-05-12 15:10, Marcin Miłkowski wrote: Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. There are also some things we need to decide: - Should we add the country variant support right away? Right now the code in HunspellRule simply takes the first country variant declared in getCountryVariants(). The dictionary files needs to be called short_lang_code_country_variant - otherwise, they won't be found. We can make it more flexible, I only hacked the initial code to make it running and to see the performance impact. - hunspell does not strip punctuation by itself, I had to add some code for single-character punctuation marks. If there is a longer punctuation mark that should be ignored, let me know. - I didn't split the build yet at all. The ant file is left without any changes, but we should probably split into two distribution files, as we don't want to bundle all languages with the LibreOffice extension. Also, the JNLP version won't work with hunspell right now just because of this. - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. Sorry, it may all blow up. Probably they should be changed but the hunspell library expects a filepath anyway, so we need to deal with absolute filepaths. - I didn't add any new messages, you will see Possible typo translations everywhere. Something different should be added to the messages file. - To add hunspell support, you simply need to mention HunspellRule.class in your language class. Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
Okay, that is the kind of thing I wanted to have long ago. So finally, it is there. Would you please check if the option of -r (from the command line) is supported? This was introduces to be able to warn for possible typo's... It is not on by default, but very useful for Dutch, since the Dutch dictionary uses this option. (it was introduced for Dutch in a Dutch language Union sponsored project). Ruud On 18-05-12 17:27, Marcin Miłkowski wrote: W dniu 2012-05-18 17:01, Ruud Baars pisze: Marcin, I must have missed some info. What is the Hunspell integration meant to do exactly? Find spelling mistakes. We have a Java rule that uses Hunspell for this purpose now. Marcin Ruud On 18-05-12 15:10, Marcin Miłkowski wrote: Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. There are also some things we need to decide: - Should we add the country variant support right away? Right now the code in HunspellRule simply takes the first country variant declared in getCountryVariants(). The dictionary files needs to be called short_lang_code_country_variant- otherwise, they won't be found. We can make it more flexible, I only hacked the initial code to make it running and to see the performance impact. - hunspell does not strip punctuation by itself, I had to add some code for single-character punctuation marks. If there is a longer punctuation mark that should be ignored, let me know. - I didn't split the build yet at all. The ant file is left without any changes, but we should probably split into two distribution files, as we don't want to bundle all languages with the LibreOffice extension. Also, the JNLP version won't work with hunspell right now just because of this. - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. Sorry, it may all blow up. Probably they should be changed but the hunspell library expects a filepath anyway, so we need to deal with absolute filepaths. - I didn't add any new messages, you will see Possible typo translations everywhere. Something different should be added to the messages file. - To add hunspell support, you simply need to mention HunspellRule.class in your language class. Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
W dniu 2012-05-18 17:54, Ruud Baars pisze: Okay, that is the kind of thing I wanted to have long ago. So finally, it is there. Would you please check if the option of -r (from the command line) is supported? This was introduces to be able to warn for possible typo's... There is no command-line for hunspell - it is a library. I don't see a way to set any options in our current interface but they might be supported by hunspell library - just no code for that in Java. It is not on by default, but very useful for Dutch, since the Dutch dictionary uses this option. (it was introduced for Dutch in a Dutch language Union sponsored project). OK, that means you'd need a special class for Dutch. Marcin Ruud On 18-05-12 17:27, Marcin Miłkowski wrote: W dniu 2012-05-18 17:01, Ruud Baars pisze: Marcin, I must have missed some info. What is the Hunspell integration meant to do exactly? Find spelling mistakes. We have a Java rule that uses Hunspell for this purpose now. Marcin Ruud On 18-05-12 15:10, Marcin Miłkowski wrote: Hi all, I just added preliminary hunspell support to LanguageTool. This is not yet ready for production, as essential parts are missing. But the infrastructure and some files are already there. Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. There are also some things we need to decide: - Should we add the country variant support right away? Right now the code in HunspellRule simply takes the first country variant declared in getCountryVariants(). The dictionary files needs to be called short_lang_code_country_variant - otherwise, they won't be found. We can make it more flexible, I only hacked the initial code to make it running and to see the performance impact. - hunspell does not strip punctuation by itself, I had to add some code for single-character punctuation marks. If there is a longer punctuation mark that should be ignored, let me know. - I didn't split the build yet at all. The ant file is left without any changes, but we should probably split into two distribution files, as we don't want to bundle all languages with the LibreOffice extension. Also, the JNLP version won't work with hunspell right now just because of this. - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. Sorry, it may all blow up. Probably they should be changed but the hunspell library expects a filepath anyway, so we need to deal with absolute filepaths. - I didn't add any new messages, you will see Possible typo translations everywhere. Something different should be added to the messages file. - To add hunspell support, you simply need to mention HunspellRule.class in your language class. Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint
Re: [Languagetool] [LanguageTool] SF.net SVN: languagetool:[6896] ...
Am I right that this method won't work out of the box, but requires changes to the Java code? I tried it but the external files failed to load. The only worry I have is that changes to the categories (like splitting a category etc.) will be more complicated. Daniel Naber wrote: grammar.xml contains only the existing entity declarations and this: !ENTITY typos SYSTEM classpath:/rules/de/typos.xml !ENTITY typos SYSTEM classpath:/rules/de/grammar.xml rules typos; grammar; (etc) /rules That means that the actual rules are included from types.xml etc. The rules can still use the entities, without duplicating them. Does that sound okay to you? -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
I hope what you are implementing is the full originall Hunspell Library, not a java port. Dutch needs the latest version to work. Ruud On 18-05-12 19:17, Daniel Naber wrote: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: Hi Marcin, Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. You are talking about Hunspell.java, aren't you? Can you document the parts that you modified in the code (sorry if I missed it)? If it makes sense we should also contribute back the changes so we can later use just the library, without copying any source files (even if it's just one or two files). Regards Daniel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
W dniu 2012-05-18 20:05, Ruud Baars pisze: I hope what you are implementing is the full originall Hunspell Library, not a java port. No, it's the original hunspell code. Dutch needs the latest version to work. Right. But Laszlo (I checked) does not offer, as far as I can see, any way to use the option -r through the library. Marcin Ruud On 18-05-12 19:17, Daniel Naber wrote: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: Hi Marcin, Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. You are talking about Hunspell.java, aren't you? Can you document the parts that you modified in the code (sorry if I missed it)? If it makes sense we should also contribute back the changes so we can later use just the library, without copying any source files (even if it's just one or two files). Regards Daniel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
W dniu 2012-05-18 19:17, Daniel Naber pisze: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: Hi Marcin, Now, I didn't add all spelling dictionaries to the source - there's only one for Polish right now as I needed at least one for testing. The dictionaries should go to resources directory. thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? Well, you mean adding hunspell under resources? - The dictionary loading methods in hunspell library were not really URL- or stream-friendly, so I only hacked them to make them work. You are talking about Hunspell.java, aren't you? Can you document the parts that you modified in the code (sorry if I missed it)? If it makes sense we should also contribute back the changes so we can later use just the library, without copying any source files (even if it's just one or two files). No, I did not modify this class. I used dirty hacks to use it without changes - only cleaned the imports and removed one space before a dot. Hardly an improvement. But we could add stemming and analyzing support, just in case someone has only a hunspell dictionary and nothing that could be a full tagger. Regards, Marcin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? Well, you mean adding hunspell under resources? I'd suggest using src/resource/pl/hunspell/pl_PL.dic instead of src/resource/pl/pl_PL.dic Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
W dniu 2012-05-18 22:57, Daniel Naber pisze: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? Well, you mean adding hunspell under resources? I'd suggest using src/resource/pl/hunspell/pl_PL.dic instead of src/resource/pl/pl_PL.dic Yes, that should make it easier for us to split builds. I will change it in a minute, regards, Marcin Regards Daniel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] Initial hunspell support
Sorry, if this sounds stupid, but LT extension for LO/AOO could link to the installed dictionaries in LO/AOO without reinstalling them? lp, m. 2012/5/18 Marcin Miłkowski list-addr...@wp.pl: W dniu 2012-05-18 22:57, Daniel Naber pisze: On Freitag, 18. Mai 2012, Marcin Miłkowski wrote: thanks for adding Hunspell support! What about moving the hunspell dict to another sub directory hunspell in our sources to make more obvious that this is just copied from a different project? Well, you mean adding hunspell under resources? I'd suggest using src/resource/pl/hunspell/pl_PL.dic instead of src/resource/pl/pl_PL.dic Yes, that should make it easier for us to split builds. I will change it in a minute, regards, Marcin Regards Daniel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel