Re: how to get char_filter to work?

IronMike Fri, 08 Aug 2014 12:01:27 -0700

Also, Here is a link for someone who had the same problem, I am not sure if 
there was a final answer to that one. 
http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
,
I have to admit that I am a bit confused now about this topic. I understand 
analyzers will tokenize the sentence and strip html in the case of the 
html_strip, and _analyze works fine using the analyzer, what I am failing 
to understand, is how can I get the results of these tokens. Isn't the 
whole idea to be able to search for them tokens eventually?


If not, whats the solution of what I would think is a common scenario, 
having to index html documents, where html tags don't need to be indexed, 
while keeping the original html for presentational purpose? Any ideas 
(Besides having to strip html tags manually before indexing?

On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:
>
> Thanks for explaining. So, is there a way to be able to get non html from 
> the index? I thought I read that it was possible to index without the html 
> tags while keeping source intact. So, how would I get at the index with non 
> html tags if you will?
>
> On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:
>>
>> The field is derived from the source and not generated from the tokens.
>>
>> If we indexed the sentence "The quick brown foxes jumped over the lazy 
>> dogs" with the english analyzer, the tokens would be
>>
>>
>> http://localhost:9200/_analyze?text=The%20quick%20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english
>>
>> quick brown fox jump over lazi dog
>>
>> After applying stopwords and stemming, the tokens do not form a sentence 
>> that looks like the original.
>>
>> -- 
>> Ivan
>>
>>
>> On Fri, Aug 8, 2014 at 9:42 AM, IronMike <sabda...@gmail.com> wrote:
>>
>>> Ivan,
>>>
>>> The search results I am showing is for the field "title" not for the 
>>> source. I thought I could query the field not the source and look at it 
>>> with no html while the source was intact. Did I misunderstand?
>>>
>>>
>>> On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:
>>>
>>>> The analyzers control how text is parsed/tokenized and how terms are 
>>>> indexed in the inverted index. The source document remains untouched.
>>>>
>>>> -- 
>>>> Ivan
>>>>
>>>>
>>>> On Fri, Aug 8, 2014 at 9:24 AM, IronMike <sabda...@gmail.com> wrote:
>>>>
>>>>>  I also used Clint's example and tried to map it to a document and 
>>>>> search the field, but still getting html in query results... Here is my 
>>>>> code. I appreciate the help.
>>>>>
>>>>> //Tokenizer
>>>>>
>>>>> PUT /foo/
>>>>> {
>>>>>  "settings": {
>>>>>    "index" : {
>>>>>       "analysis" : {
>>>>>          "analyzer" : {
>>>>>             "test_1" : {
>>>>>                "char_filter" : [
>>>>>                   "html_strip"
>>>>>                ],
>>>>>                "tokenizer" : "standard"
>>>>>             }
>>>>>          }
>>>>>       }
>>>>>    }
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> //Mapping
>>>>> PUT /foo/foo_type/_mapping
>>>>> {
>>>>>   "foo_type":{ 
>>>>>          "properties" : {
>>>>>                    "title": {
>>>>>                          "type":"string",
>>>>>                          "index": "analyzed", 
>>>>>                          "analyzer":"test_1"
>>>>>                          }
>>>>>                        }
>>>>>            }
>>>>> }
>>>>>
>>>>>
>>>>> Get /foo/foo_type/_mapping
>>>>> {
>>>>>    "foo": {
>>>>>       "mappings": {
>>>>>          "foo_type": {
>>>>>             "properties": {
>>>>>                "date": {
>>>>>                   "type": "date",
>>>>>                   "format": "dateOptionalTime"
>>>>>                },
>>>>>                "title": {
>>>>>                   "type": "string",
>>>>>                   "analyzer": "test_1"
>>>>>                }
>>>>>             }
>>>>>          }
>>>>>       }
>>>>>    }
>>>>> }
>>>>>
>>>>>
>>>>> ////Index/////////////
>>>>> PUT /foo/foo_type/1
>>>>> {
>>>>>     "date" : "2009-11-15T14:12:12",
>>>>>     "title" : "The quick & <b>brown</b> fox"
>>>>> }
>>>>>
>>>>>
>>>>> //Search //////////
>>>>> GET /foo/_search?pretty:true
>>>>> {
>>>>>    "fields": ["title"], 
>>>>>     "query": {
>>>>>         "query_string": {
>>>>>             "query": "brown",
>>>>>             "analyzer": "test_1"
>>>>>         }
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>> //Results showing html tags still//////
>>>>> "hits": [
>>>>>          {
>>>>>             "_index": "foo",
>>>>>             "_type": "foo_type",
>>>>>             "_id": "1",
>>>>>             "_score": 0.076713204,
>>>>>             "fields": {
>>>>>                "title": [
>>>>>                   "The quick & <b>brown</b> fox" 
>>>>>                ]
>>>>>             }
>>>>>
>>>>>
>>>>>
>>>>> On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:
>>>>>
>>>>>> Have you checked Clint's example?
>>>>>>
>>>>>> https://gist.github.com/clintongormley/780895
>>>>>>
>>>>>> Jörg
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com> wrote:
>>>>>>
>>>>>>>  I would like to strip html tags for indexing. Here is a simple 
>>>>>>> example I tried so far, but doesn't seem to strip html tags. Any ideas 
>>>>>>> what's missing?
>>>>>>>
>>>>>>> //settings & Mappings
>>>>>>> POST twitter
>>>>>>> {
>>>>>>>   "mappings": {
>>>>>>>     "tweet" : {
>>>>>>>       "properties" : {
>>>>>>>         "message" : {
>>>>>>>           "type" :    "string",
>>>>>>>           "analyzer": "strip_html_analyzer"
>>>>>>>         },
>>>>>>>         "date" : {
>>>>>>>           "type" :   "date"
>>>>>>>         },
>>>>>>>         "name" : {
>>>>>>>           "type" :   "string"
>>>>>>>         }
>>>>>>>       }
>>>>>>>     }
>>>>>>>   },
>>>>>>>   "settings": {
>>>>>>>     "analysis": {
>>>>>>>       "analyzer": {
>>>>>>>         "strip_html_analyzer":{
>>>>>>>             "type":"custom",
>>>>>>>             "tokenizer":"standard",
>>>>>>>             "filter":"standard",
>>>>>>>             "char_filter":"my_html"
>>>>>>>         }
>>>>>>>       },
>>>>>>>       "char_filter": {
>>>>>>>           "my_html":{
>>>>>>>               "type":"html_strip"
>>>>>>>           }
>>>>>>>       }
>>>>>>>     }
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> //Index a document
>>>>>>> PUT /twitter/tweet/1
>>>>>>> {
>>>>>>>     "name" : "mike",
>>>>>>>     "date" : "2009-11-15T14:12:12",
>>>>>>>     "message" : "<html>trying out <b>Elasticsearch</b>, This is an 
>>>>>>> html test</html>"
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> //query result for "html", I expect the query to return nothing 
>>>>>>> since it is supposed to strip the tag?
>>>>>>> "hits": {
>>>>>>>       "total": 1,
>>>>>>>       "max_score": 0.11626227,
>>>>>>>       "hits": [
>>>>>>>          {
>>>>>>>             "_index": "twitter",
>>>>>>>             "_type": "tweet",
>>>>>>>             "_id": "1",
>>>>>>>             "_score": 0.11626227,
>>>>>>>             "fields": {
>>>>>>>                "message": [
>>>>>>>                   "<html>trying out <b>Elasticsearch</b>, This is an 
>>>>>>> html test</html>"
>>>>>>>                ]
>>>>>>>             },
>>>>>>>             "highlight": {
>>>>>>>                "message": [
>>>>>>>                   "<html>trying out <b>Elasticsearch</b>, This is an 
>>>>>>> <em>html</em> test</html>"
>>>>>>>                ]
>>>>>>>             }
>>>>>>>          }
>>>>>>>       ]
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
>>>>>>> 8-4646-bc8f-a27896454515%40googlegroups.com 
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: how to get char_filter to work?

Reply via email to