On Sat, Aug 11, 2012 at 11:55 AM, Tino Didriksen
<tino.didrik...@gmail.com>wrote:

> On Fri, Aug 10, 2012 at 10:16 PM, Mikel Artetxe <artet...@gmail.com>wrote:
>
>> 1) Invoke it as an external program.
>>
> Probably the easiest to get working, but does add a silly text generation
> and parsing step.
>

I've implemented it at revision
40279<http://apertium.svn.sourceforge.net/viewvc/apertium?view=revision&revision=40279>.
My code is quite clumsy and needs more work (it assumes that all the
parameters are paths to existing files and always tries to extract them, it
doesn't deal with whitespaces, it unnecessarily copies files to a temporary
directory even if the file was directly accessible...), but it does the
trick. So now it is possible to use Apertium Caffeine or the OmegaT plug-in
with language pairs that depend on CG as long as you have CG installed in
your machine. I haven't created packages for those language pairs (and I
think that we shouldn't do it), so you will need to create the packages by
yourself and manually install them.



>
>
>>  2) Create a Java interface for CG using JNI. ... For instance, just
>> looking at the installation instructions I see that it depends on some
>> external libraries, so things start getting more complex...
>>
> Boost is header-only, so doesn't add any files to the distribution.
> libtcmalloc is optional.
> ICU is the heavy one. I've looked at removing ICU and making a UTF-8-only
> version of CG-3, since everyone uses just UTF-8 these days. The key problem
> with that is regular expressions: I pass regex off to ICU's very nice
> Unicode character class (e.g. \p{Katakana}) capable regex engine.
> >From what I could find, the only C++ engines capable of UTF-8 and Unicode
> character classes are ICU and PCRE, so that would be trading one library
> for another less capable one.
>

I don't have much experience with JNI, but I would say that it would
probably be trickier than what it might seem...

libcg3.so takes 900 KB in my machine. The JARs would need to include a
library for, at least, Linux, Windows and OS X, so they would be, at least,
about 3 MB bigger (and that's without taking ICU into account). At the same
time, we would need to make it part of lttoolbox-java (or the programs that
are based on it), which would require compiling CG targeting different
platforms and using NDK for Android. All this is certainly doable, but I
think that it would considerably increase the complexity of our current
approach, making it harder to maintain. And then there is ICU...

All in all, I think that this solution goes against one of the main
advantages of Java: portability. It would require compiling CG for each
platform that we would be supporting, and embedding the right binaries for
all of them. And we would probably have problems to make it work under
restricted environments like Java Web Start...



>
>
>> 3) Develop a Java port of it. Probably the best solution but, obviously,
>> the hardest one to implement...
>>
> Haven't really looked into that as I consider JNI a better solution.
> But, it's all hash maps and hash sets, so maybe not that hard to convert.
>
Again, regex is a significant feature and apparently only Java 7 and newer
> gets that right.
>

As far as I know java.util.regex is available since early versions of Java
(you can look here
<http://docs.oracle.com/javase/tutorial/essential/regex/>for more
details about what it offers). But perhaps Java 7 introduces some
significant improvements in this field, I don't know...
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to