Many thanks Alex,
For clearness, I try explaining a bit what I would like to do:
I'd like to use mediawiki as a base for this project.
The need is being able to search with simple strings without grammatical
details and retrieve data with grammatical details.
For that, I am evaluating to use a wikimedia extension called CirrusSearch.
CirrusSearch depends from elasticsearch, while elasticsearch depends on
Lucene.
CirrusSearch (and its dependencies) is used, for instance, by the modern
greek wictionary, and works correctly for modern greek grammatical details.
In this case, if you input αλφα it will retrieve also άλφα
but in the case of ancient greek, οργανον will not retrieve Ὄργανον
since its grammatical details are proper of ancient greek and do not
appear to be supported.
Since this kind of wikipedia search is at end based on lucene, adding
this feature to lucene will potentially make this feature available also
for wikimedia.
As Tim remarks in following message, it seems that ICU is able to
support this.
I have to investigate a little more about this, and check if CirruSearch
is implementing ICU.
About the third link you are providing:
https://issues.apache.org/jira/browse/LUCENE-1343
It seems that the first one I indicated:
https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/el/GreekLowerCaseFilter.java
Does something similar but specialized for greek. This source converts
also some diacritics, but is lacking many other chars.
At a first point, my idea was adding extra normalization here.
I'll do some other searches next week, both in lucene and in
cirrusSearch docs and I'll let you know
Thanks to you and Tim for taking time on this
Regards
Paolo
On 21/11/2014 21:07, Alexandre Rafalovitch wrote:
Are you sure that's not something that's already addressed by the ICU
Filter?
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
If you follow the links to what's possible, the page talks about
Greek, though not ancient:
http://userguide.icu-project.org/transforms/general#TOC-Greek
There was also some discussion on:
https://issues.apache.org/jira/browse/LUCENE-1343
Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
On 21 November 2014 14:14, paolo anghileri
<paolo.anghil...@codegeneration.it> wrote:
For development purposes I need the ability in lucene to normalize ancient
greek characters for al the cases of grammatical details such as accents,
diacritics and so on.
My need is to retrieve ancient greek words with accents and other
grammatical details by the input of the string without accents.
For example the input of οργανον (organon) should to retrieve also Ὄργανον,
I am not a lucene commiter and I a new to this so my question is about the
best practice to implement this in Lucene, and possibile submit a commit
proposal to Lucene A project management committee.
I have made some searches and found this file in Lucene-soir:
It contains normalization for some chars.
My thought would be to add extra normalization here, including all unicode
ancient greek chars with all grammatical details.
I already have all the unicode values for that chars so It should not be
difficult for me to include them
If my understanding is correct, this should add to lucene the features
described above.
As I am new to this, my needs are:
To be sure that this is the correct place in Lucene for doing normalization
How to post commit proposal
Any help appreciated
Kind regards
Paolo
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org