Re: Extending Thai analyzer.

Nikolas Everett Fri, 07 Feb 2014 06:18:52 -0800

If you don't like the language analyzer you have to rebuild it as a custom
analyzer then add what you need to it.

{
  "analyzer": {
    "thai_with_ngram": {
      "type": "custom",
      "tokenizer": "standard",
      "filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
    }
  },
  "filter": {
    "thai": {
      "type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
    },
    "thai_stop": {
      "type": "stop",
      "stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
    },
    "ngram": { your ngram configuration here }
  }
}

Builds it with your ngram configuration.  I think.  I'm taking quite a few
educated guesses here so I expect you to have to fiddle with it to get it
right.

How I did this:
1.  Open the class called ThaiAnalyzer in the Lucene version Elasticsearch
is using and find the method called createComponents.  For me this is
simple because I have Elasticsearch open in Eclipse.
2.  That method defines the tokenizer (standard) and some filters
(standard, lowercase, ThaiWordFilter, and stop.  You have to be able to
translate the class names to Elasticsearch's easier names to get this to
work properly.
3.  Now build it as a custom filter with your extra filter in there.  That
is "thai_with_ngram" above.
4.  Next you'll need to define all the filters that don't exist by default
in Elasticsearch.  In this case that is thai, thai_stop, and your ngram
filter.  In order:
5.  The thai filter doesn't have an easy Elasticsearch mapping so you have
to tell Elasticsearch the class name to load.  That class doesn't take an
configuration so we're done.
6.  The thai_stop filter is just a regular stop word filter with thai stop
words.  But Elasticserach doesn't have an easy name to reference the thai
stop words file.  That isn't too bad, as you can load the stopwords file
from the classepath.  It lives in Lucene at the path I added above.
7.  The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it.  For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha <minslo...@gmail.com> wrote:

> Hi folks.
>
> I would like to develop for a searching system for Thai language.
> First of all, I found Thai analyzer and it seemed like good.
>
> Actually, but, It doesn`t meet my whole requirement.
> I decided what extends it.
> For example, I would like to add nGram token filter on the Thai analyzer
> without any changes on it.
>
> How to do this?
> Please, give me some advice.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com
> .
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3AsKcZP9H0exHFbMzeLeZJhi8TfN8-pBRwu2rkkU29Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Extending Thai analyzer.

Reply via email to