Analyzers and char_filters o_0 creepy outputs

georgi . mateev Thu, 22 May 2014 07:37:39 -0700

Hi! This is a sample setup, close to what I am working with

https://gist.github.com/anonymous/6e1457321a8ad78c6af8

As you can see, I am trying to remove the hyphens from all words, so that
words like "hand-made" are indexed as "handmade". The goal is to make a
search for "handmade" find all documents, containing "hand-made" and vice
versa.
For some reason it doesn't work, though :(

I have also attached 3 sample queries. The expected result would be for all
of them to return the same result set.
1) Astonishingly, a search for "Chemie-injenieur" finds 2 results, but a
search for "Chemieingenieur" finds none. This is pretty creepy to me, since
the char_filter is supposed to strip the hyphens prior to tokenizing in the
indexing process.
2) Another creepy fact is that if I specify the searchAnalyzer explicitly,
I find no results (see query 3) from this document set
3) Moreover the analyzeAPI shows that the search term "Chemie-ingenieur"
gets translated to "Chemieingenieur" using this analyzer

4) And the most creepy facts is that when I run these queries with the
actual index data (800+ documents), I get 17 results for "Chemie-ingenieur"
and 22 for "Chemieingenieur", where NONE OF THEM OVERLAPS. I.e. I have a
total of 39 documents that should be matching either of the queries. Some
of the documents that match "Chemie-ingenieur" actually don't contain the
word with the hyphen. So I would expect these documents to be contained in
both result sets, maybe with a different relevancy score. This is, however,
not the case.

Please help me get over this, I have been struggling with it for a full
week already. I would be very grateful for some explanation too, apart from
a solution, since the output is much different that what I expect from my
understanding and this means that I don't really understand the system.

P.S. Please focus on the actual problem and let's not discuss the mapping
into details. The version I have pasted is pretty different than what I
have started with initially, due to the try-and-error approach I have been
using for almost a week.

Thanks sincerely,
Georgi

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/417363d0-965f-4398-8174-9889db47d50b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Analyzers and char_filters o_0 creepy outputs

Reply via email to