Re: how to get char_filter to work?

IronMike Fri, 08 Aug 2014 17:16:26 -0700

Thanks again. I wasn't expecting it to remove what's between the tags. I 
believe I understand the behavior and maybe its the case where I was greedy 
and expecting ElasticSearch to do it all.
Here is a scenario that I was looking for: Assume I am looking to get an 
excerpt of text (Extracted text from a document), Elastic Search query will 
give me excerpt with html tags, but the tags are out of context, so I would 
have liked to be to display this excerpt with no html tags, I know I can 
probably strip the tags after the fact, but that's what I was trying to 
avoid.  In other words, in a perfect world, I would have liked 2 versions 
of the document, the original html one and another stripped one. When I 
need to query things like excerpts, I would query the stripped one, and 
when I needed the html, I would query the source. Hopefully I didn't make 
this more confusing.


On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote:
>
> The tokens that appear in the analyze API are the ones that are put into 
> the inverted index. When you search for one of the terms that is not an 
> HTML tag, there will be a match. What I don't understand after reading in 
> detail your original, is exactly what behavior you are expecting.
>
> You indexed the phrase
> <html>trying out <b>Elasticsearch</b>, This is an html test</html>
>
> but you expected a query for the term "html" to not match. However, the 
> work "html" is clearly in the content. The html stripper will not remove 
> the contents in between the tags, just the tags themselve. The analyze API 
> should show you the correct term.
>
> Lucene has more control over what information you can retrieve, but the 
> only way to get the analyzed token stream back from Elasticsearch is to use 
> the analyze API on the field. Most people do not want an analyzed token 
> stream, just the original field.
>
> -- 
> Ivan
>
>
> On Fri, Aug 8, 2014 at 12:01 PM, IronMike <sabda...@gmail.com 
> <javascript:>> wrote:
>
>> Also, Here is a link for someone who had the same problem, I am not sure 
>> if there was a final answer to that one. 
>> http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
>> ,
>> I have to admit that I am a bit confused now about this topic. I 
>> understand analyzers will tokenize the sentence and strip html in the case 
>> of the html_strip, and _analyze works fine using the analyzer, what I am 
>> failing to understand, is how can I get the results of these tokens. Isn't 
>> the whole idea to be able to search for them tokens eventually?
>>
>> If not, whats the solution of what I would think is a common scenario, 
>> having to index html documents, where html tags don't need to be indexed, 
>> while keeping the original html for presentational purpose? Any ideas 
>> (Besides having to strip html tags manually before indexing?
>>
>>
>> On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:
>>>
>>> Thanks for explaining. So, is there a way to be able to get non html 
>>> from the index? I thought I read that it was possible to index without the 
>>> html tags while keeping source intact. So, how would I get at the index 
>>> with non html tags if you will?
>>>
>>> On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:
>>>>
>>>> The field is derived from the source and not generated from the tokens.
>>>>
>>>> If we indexed the sentence "The quick brown foxes jumped over the lazy 
>>>> dogs" with the english analyzer, the tokens would be
>>>>
>>>> http://localhost:9200/_analyze?text=The%20quick%
>>>> 20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english
>>>>
>>>> quick brown fox jump over lazi dog
>>>>
>>>> After applying stopwords and stemming, the tokens do not form a 
>>>> sentence that looks like the original.
>>>>
>>>> -- 
>>>> Ivan
>>>>
>>>>
>>>> On Fri, Aug 8, 2014 at 9:42 AM, IronMike <sabda...@gmail.com> wrote:
>>>>
>>>>> Ivan,
>>>>>
>>>>> The search results I am showing is for the field "title" not for the 
>>>>> source. I thought I could query the field not the source and look at it 
>>>>> with no html while the source was intact. Did I misunderstand?
>>>>>
>>>>>
>>>>> On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:
>>>>>
>>>>>> The analyzers control how text is parsed/tokenized and how terms are 
>>>>>> indexed in the inverted index. The source document remains untouched.
>>>>>>
>>>>>> -- 
>>>>>> Ivan
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 8, 2014 at 9:24 AM, IronMike <sabda...@gmail.com> wrote:
>>>>>>
>>>>>>>  I also used Clint's example and tried to map it to a document and 
>>>>>>> search the field, but still getting html in query results... Here is my 
>>>>>>> code. I appreciate the help.
>>>>>>>
>>>>>>> //Tokenizer
>>>>>>>
>>>>>>> PUT /foo/
>>>>>>> {
>>>>>>>  "settings": {
>>>>>>>    "index" : {
>>>>>>>       "analysis" : {
>>>>>>>          "analyzer" : {
>>>>>>>             "test_1" : {
>>>>>>>                "char_filter" : [
>>>>>>>                   "html_strip"
>>>>>>>                ],
>>>>>>>                "tokenizer" : "standard"
>>>>>>>             }
>>>>>>>          }
>>>>>>>       }
>>>>>>>    }
>>>>>>>  }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> //Mapping
>>>>>>> PUT /foo/foo_type/_mapping
>>>>>>> {
>>>>>>>   "foo_type":{ 
>>>>>>>          "properties" : {
>>>>>>>                    "title": {
>>>>>>>                          "type":"string",
>>>>>>>                          "index": "analyzed", 
>>>>>>>                          "analyzer":"test_1"
>>>>>>>                          }
>>>>>>>                        }
>>>>>>>            }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> Get /foo/foo_type/_mapping
>>>>>>> {
>>>>>>>    "foo": {
>>>>>>>       "mappings": {
>>>>>>>          "foo_type": {
>>>>>>>             "properties": {
>>>>>>>                "date": {
>>>>>>>                   "type": "date",
>>>>>>>                   "format": "dateOptionalTime"
>>>>>>>                },
>>>>>>>                "title": {
>>>>>>>                   "type": "string",
>>>>>>>                   "analyzer": "test_1"
>>>>>>>                }
>>>>>>>             }
>>>>>>>          }
>>>>>>>       }
>>>>>>>    }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> ////Index/////////////
>>>>>>> PUT /foo/foo_type/1
>>>>>>> {
>>>>>>>     "date" : "2009-11-15T14:12:12",
>>>>>>>     "title" : "The quick & <b>brown</b> fox"
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> //Search //////////
>>>>>>> GET /foo/_search?pretty:true
>>>>>>> {
>>>>>>>    "fields": ["title"], 
>>>>>>>     "query": {
>>>>>>>         "query_string": {
>>>>>>>             "query": "brown",
>>>>>>>             "analyzer": "test_1"
>>>>>>>         }
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> //Results showing html tags still//////
>>>>>>> "hits": [
>>>>>>>          {
>>>>>>>             "_index": "foo",
>>>>>>>             "_type": "foo_type",
>>>>>>>             "_id": "1",
>>>>>>>             "_score": 0.076713204,
>>>>>>>             "fields": {
>>>>>>>                "title": [
>>>>>>>                   "The quick & <b>brown</b> fox" 
>>>>>>>                ]
>>>>>>>             }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:
>>>>>>>
>>>>>>>> Have you checked Clint's example?
>>>>>>>>
>>>>>>>> https://gist.github.com/clintongormley/780895
>>>>>>>>  
>>>>>>>> Jörg
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>  I would like to strip html tags for indexing. Here is a simple 
>>>>>>>>> example I tried so far, but doesn't seem to strip html tags. Any 
>>>>>>>>> ideas 
>>>>>>>>> what's missing?
>>>>>>>>>
>>>>>>>>> //settings & Mappings
>>>>>>>>> POST twitter
>>>>>>>>> {
>>>>>>>>>   "mappings": {
>>>>>>>>>     "tweet" : {
>>>>>>>>>       "properties" : {
>>>>>>>>>         "message" : {
>>>>>>>>>           "type" :    "string",
>>>>>>>>>           "analyzer": "strip_html_analyzer"
>>>>>>>>>         },
>>>>>>>>>         "date" : {
>>>>>>>>>           "type" :   "date"
>>>>>>>>>         },
>>>>>>>>>         "name" : {
>>>>>>>>>           "type" :   "string"
>>>>>>>>>         }
>>>>>>>>>       }
>>>>>>>>>     }
>>>>>>>>>   },
>>>>>>>>>   "settings": {
>>>>>>>>>     "analysis": {
>>>>>>>>>       "analyzer": {
>>>>>>>>>         "strip_html_analyzer":{
>>>>>>>>>             "type":"custom",
>>>>>>>>>             "tokenizer":"standard",
>>>>>>>>>             "filter":"standard",
>>>>>>>>>             "char_filter":"my_html"
>>>>>>>>>         }
>>>>>>>>>       },
>>>>>>>>>       "char_filter": {
>>>>>>>>>           "my_html":{
>>>>>>>>>               "type":"html_strip"
>>>>>>>>>           }
>>>>>>>>>       }
>>>>>>>>>     }
>>>>>>>>>   }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> //Index a document
>>>>>>>>> PUT /twitter/tweet/1
>>>>>>>>> {
>>>>>>>>>     "name" : "mike",
>>>>>>>>>     "date" : "2009-11-15T14:12:12",
>>>>>>>>>     "message" : "<html>trying out <b>Elasticsearch</b>, This is an 
>>>>>>>>> html test</html>"
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> //query result for "html", I expect the query to return nothing 
>>>>>>>>> since it is supposed to strip the tag?
>>>>>>>>> "hits": {
>>>>>>>>>       "total": 1,
>>>>>>>>>       "max_score": 0.11626227,
>>>>>>>>>       "hits": [
>>>>>>>>>          {
>>>>>>>>>             "_index": "twitter",
>>>>>>>>>             "_type": "tweet",
>>>>>>>>>             "_id": "1",
>>>>>>>>>             "_score": 0.11626227,
>>>>>>>>>             "fields": {
>>>>>>>>>                "message": [
>>>>>>>>>                   "<html>trying out <b>Elasticsearch</b>, This is 
>>>>>>>>> an html test</html>"
>>>>>>>>>                ]
>>>>>>>>>             },
>>>>>>>>>             "highlight": {
>>>>>>>>>                "message": [
>>>>>>>>>                   "<html>trying out <b>Elasticsearch</b>, This is 
>>>>>>>>> an <em>html</em> test</html>"
>>>>>>>>>                ]
>>>>>>>>>             }
>>>>>>>>>          }
>>>>>>>>>       ]
>>>>>>>>>    }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "elasticsearch" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>>>
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
>>>>>>>>> 8-4646-bc8f-a27896454515%40googlegroups.com 
>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47
>>>>>>> c-4c35-a40b-058e3c1b1043%40googlegroups.com 
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/99b703a3-34df-4e96-8c8e-5f692b60ab09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: how to get char_filter to work?

Reply via email to