[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697952#action_12697952
 ] 

Otis Gospodnetic commented on LUCENE-1284:
------------------------------------------

Hi Felipe,

OK, I looked at this some more.  So the Java code you contributed is ASL and 
Apertium's tools (and data?) is GPL v2?

The thing that puzzles me are the language pairs themselves.  Why are they in 
pairs?  Is that simply for the translation part of Apertium, and something 
that's ignored when you use the pair for Lucene and morphological analysis?

If I'm interested in, say, French morphological analyzer, why do I need any 
other language?  For French, I see:

* br-fr
* en-fr
* fr-ca
* fr-es

If I'm interested in French, which of the 4 above is the right one to use?  The 
one with the highest number of lemmata?

I had a look at the Indexer and Searcher to get an idea about the usage.  Those 
classes are really just for demonstration, right?  Still, do you mind replacing 
the deprecated Hits object in the Searcher class?

In the README you mention this:
{quote}
2. The Spanish morphological dictionary must be preprocessed in advance to 
remove multiword expressions:

$ java -classpath lucene-apertium-morph-2.4-dev.jar \
  org.apache.lucene.apertium.tools.RemoveMultiWordsFromDix \
  --dix apertium-es-ca.es.dix  > apertium-es-ca.es-nomw.dix
{quote}

Could you explain why the removal of multiword expressions is needed?
Is that Spanish-specific or something one needs to do regardless of the 
language?

Also:
{quote}
4. Each file to be indexed must be preprocessed using the Apertium tools:

$ cat file.txt | apertium-destxt | lt-proc -a es-ca-nomw.automorf.bin | 
apertium-tagger -g -f es-ca.prob > file.pos.txt
{quote}

So these are a few command-line tools that end up marking up the input text 
with POS? (I seem to be missing some libraries and can't compile Apterium 
locally to check what that this marked up file looks like).
But my main question here is whether there are Java equivalents of these 
command-line tools, so that one can easily use them from Java?  Or is one 
forced to use Runtime.exec(...)?

Thanks.

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1284
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1284
>             Project: Lucene - Java
>          Issue Type: New Feature
>         Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>            Reporter: Felipe Sánchez Martínez
>            Assignee: Otis Gospodnetic
>         Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to