Re: Help needed understanding analyzer behavior

2014-08-01 Thread Sina Tamanna
When I develope custom analyzers I use Analyze API to test it and 
understand the tokens that will be indexed. Take a look 
at 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html
 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7baa88a7-1691-45d6-bb96-a7bf39813cb9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Help needed understanding analyzer behavior

2014-07-30 Thread Neko Escondido
Hi Nikolas
Thank you very much for your feedback.  I was hoping to be able to search 
against the phone number field in normalized, original, number parts format.
If I modify the input into normalized format, then, search using 
original/number parts will not return the desired result... 
Or am I misunderstanding your suggestion?
Multi-field indexing is an option but that is to be avoided if possible (so 
that client executing query does not have to know all the possible field 
names a phone number field might be mapped)...
Once again, thank you very much for your feedback.  What I described above 
sounds possible using char filter/plugin?


On Wednesday, July 30, 2014 8:28:35 PM UTC-7, Nikolas Everett wrote:
>
> It's probably easier to do a char filter to remove all non digits. On the 
> other hand if you want to normalize numbers that sometimes contain area and 
> country code to numbers you'll probably want to do that outside of 
> elasticsearch or with a plugin. That gets difficult when you need to handle 
> non NANPA numbers. 
> On Jul 30, 2014 11:14 PM, "Neko Escondido"  > wrote:
>
>> Hello community,
>>
>> I'm having problem understanding how analyzer should work.  The result is 
>> different from what I expect.  :(
>>
>> I have created a custom analyzer to index phone number as below:
>>
>> "analysis" : {
>> "analyzer" : {
>> "phone" : {
>> "type": "custom",
>> "tokenizer":"phone_tokenizer",
>> "filter" : [ "phone_filter", "unique" ]
>> }
>> },
>>"tokenizer" : {
>> "phone_tokenizer" : {
>> "type" : "pattern",
>> "pattern":"\\s*[a-zA-Z]+\\s*"
>> 
>> }
>>},
>> "filter" : {
>>"phone_filter" : {
>> "type" : "word_delimiter",
>> "preserve_original" : 1,
>> "generate_number_parts" : 1,
>> "catenate_numbers" : 1
>>}
>> }
>>}
>>
>>
>>
>> The intention is to match:
>> Query Input: 
>>  111222, 111.222., 111-222-, or 111 222 , 
>> (111)222, 1-(111)-222-, etc. 
>> With records containing phone number such as:
>>  111.222., 111-222-, or 111 222 , (111)222, 
>> 1-(111)-222-, etc. 
>>
>> So with search input: (111)222 with queryType "matchPhraseQuery", I 
>> thought the query will return the records with phone number such as 
>> 111.222., 111-222-, etc. because input (111)222 would be 
>> analyzed into 111222, 111, and 222.
>> Given I have specified "catenate_numbers" in filter for my "phone" 
>> analyzer, I would expect that numbers the numbers that meets the following 
>> condition will be matched:
>> Match numbers that are indexed as ( 111 AND 222 ) OR 111222.
>> But result is no match.  
>>
>> Is my understanding incorrect?  With search input (111)222 using 
>> matchPhraseQuery, I thought it will match all numbers that has 111222 
>> as the concatenated value but it seems to match only with numbers whose 
>> number parts are 111 and 222... 
>>
>> Your feedback/help/input is greatly appreciated!!
>> Best regards
>>
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2227f168-ec3d-4bad-95d0-09b2082f2c08%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Help needed understanding analyzer behavior

2014-07-30 Thread Nikolas Everett
It's probably easier to do a char filter to remove all non digits. On the
other hand if you want to normalize numbers that sometimes contain area and
country code to numbers you'll probably want to do that outside of
elasticsearch or with a plugin. That gets difficult when you need to handle
non NANPA numbers.
On Jul 30, 2014 11:14 PM, "Neko Escondido"  wrote:

> Hello community,
>
> I'm having problem understanding how analyzer should work.  The result is
> different from what I expect.  :(
>
> I have created a custom analyzer to index phone number as below:
>
> "analysis" : {
>"analyzer" : {
>"phone" : {
>"type": "custom",
>"tokenizer":"phone_tokenizer",
>"filter" : [ "phone_filter", "unique" ]
>}
>},
>"tokenizer" : {
>"phone_tokenizer" : {
>"type" : "pattern",
>"pattern":"\\s*[a-zA-Z]+\\s*"
>
>}
>},
>"filter" : {
>   "phone_filter" : {
>"type" : "word_delimiter",
>"preserve_original" : 1,
>"generate_number_parts" : 1,
>"catenate_numbers" : 1
>   }
>}
>}
>
>
>
> The intention is to match:
> Query Input:
>  111222, 111.222., 111-222-, or 111 222 ,
> (111)222, 1-(111)-222-, etc.
> With records containing phone number such as:
>  111.222., 111-222-, or 111 222 , (111)222,
> 1-(111)-222-, etc.
>
> So with search input: (111)222 with queryType "matchPhraseQuery", I
> thought the query will return the records with phone number such as
> 111.222., 111-222-, etc. because input (111)222 would be
> analyzed into 111222, 111, and 222.
> Given I have specified "catenate_numbers" in filter for my "phone"
> analyzer, I would expect that numbers the numbers that meets the following
> condition will be matched:
> Match numbers that are indexed as ( 111 AND 222 ) OR 111222.
> But result is no match.
>
> Is my understanding incorrect?  With search input (111)222 using
> matchPhraseQuery, I thought it will match all numbers that has 111222
> as the concatenated value but it seems to match only with numbers whose
> number parts are 111 and 222...
>
> Your feedback/help/input is greatly appreciated!!
> Best regards
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1dLr4bOVsmeudfA29Pm12AH7VPAP9%2BPieHRGi7RyAZow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Help needed understanding analyzer behavior

2014-07-30 Thread Neko Escondido
Hello community,

I'm having problem understanding how analyzer should work.  The result is 
different from what I expect.  :(

I have created a custom analyzer to index phone number as below:

"analysis" : {
   "analyzer" : {
   "phone" : {
   "type": "custom",
   "tokenizer":"phone_tokenizer",
   "filter" : [ "phone_filter", "unique" ]
   }
   },
   "tokenizer" : {
   "phone_tokenizer" : {
   "type" : "pattern",
   "pattern":"\\s*[a-zA-Z]+\\s*"
   
   }
   },
   "filter" : {
  "phone_filter" : {
   "type" : "word_delimiter",
   "preserve_original" : 1,
   "generate_number_parts" : 1,
   "catenate_numbers" : 1
  }
   }
   }



The intention is to match:
Query Input: 
 111222, 111.222., 111-222-, or 111 222 , (111)222, 
1-(111)-222-, etc. 
With records containing phone number such as:
 111.222., 111-222-, or 111 222 , (111)222, 
1-(111)-222-, etc. 

So with search input: (111)222 with queryType "matchPhraseQuery", I 
thought the query will return the records with phone number such as 
111.222., 111-222-, etc. because input (111)222 would be 
analyzed into 111222, 111, and 222.
Given I have specified "catenate_numbers" in filter for my "phone" 
analyzer, I would expect that numbers the numbers that meets the following 
condition will be matched:
Match numbers that are indexed as ( 111 AND 222 ) OR 111222.
But result is no match.  

Is my understanding incorrect?  With search input (111)222 using 
matchPhraseQuery, I thought it will match all numbers that has 111222 
as the concatenated value but it seems to match only with numbers whose 
number parts are 111 and 222... 

Your feedback/help/input is greatly appreciated!!
Best regards



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.