Re: Lucene ancient greek normalization

paolo anghileri Fri, 21 Nov 2014 13:14:28 -0800

Many thanks Alex,

For clearness, I try explaining a bit what I would like to do:
I'd like to use mediawiki as a base for this project.

The need is being able to search with simple strings without grammaticaldetails and retrieve data with grammatical details.

For that, I am evaluating to use a wikimedia extension called CirrusSearch.

CirrusSearch depends from elasticsearch, while elasticsearch depends onLucene.

CirrusSearch (and its dependencies) is used, for instance, by the moderngreek wictionary, and works correctly for modern greek grammatical details.


In this case, if you input αλφα it will retrieve also άλφα

but in the case of ancient greek, οργανον will not retrieve Ὄργανονsince its grammatical details are proper of ancient greek and do notappear to be supported.

Since this kind of wikipedia search is at end based on lucene, addingthis feature to lucene will potentially make this feature available alsofor wikimedia.

As Tim remarks in following message, it seems that ICU is able tosupport this.

I have to investigate a little more about this, and check if CirruSearchis implementing ICU.


About the third link you are providing:

https://issues.apache.org/jira/browse/LUCENE-1343

It seems that the first one I indicated:

https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java

Does something similar but specialized for greek. This source convertsalso some diacritics, but is lacking many other chars.

At a first point, my idea was adding extra normalization here.

I'll do some other searches next week, both in lucene and incirrusSearch docs and I'll let you know



Thanks to you and Tim for taking time on this

Regards

Paolo









On 21/11/2014 21:07, Alexandre Rafalovitch wrote:

Are you sure that's not something that's already addressed by the ICU
Filter? 
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html

If you follow the links to what's possible, the page talks about
Greek, though not ancient:
http://userguide.icu-project.org/transforms/general#TOC-Greek

There was also some discussion on:
https://issues.apache.org/jira/browse/LUCENE-1343

Regards,
    Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 21 November 2014 14:14, paolo anghileri
<paolo.anghil...@codegeneration.it> wrote:

For development purposes I need the ability in lucene to normalize ancient
greek characters for al the cases of grammatical details such as accents,
diacritics and so on.

My need is to retrieve ancient greek words with accents and other
grammatical details by the input of the string without accents.

For example the input of οργανον (organon) should to retrieve also  Ὄργανον,


I am not a lucene commiter and I a new to this so my question is about the
best practice to implement this in Lucene, and possibile submit a commit
proposal to Lucene A project management committee.

I have made some searches and found this file in Lucene-soir:


It contains normalization for some chars.
My thought would be to add extra normalization here, including all unicode
ancient greek chars with all grammatical details.
I already have all the unicode values for that chars so It should not be
difficult for me to include them

If my understanding is correct, this should add to lucene the features
described above.


As I am new to this, my needs are:

  To be sure that this is the correct place in Lucene for doing normalization
How to post commit proposal


Any help appreciated

Kind regards

Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Lucene ancient greek normalization

Reply via email to