Re: how to get char_filter to work?

IronMan2014 Wed, 13 Aug 2014 08:17:50 -0700

Ivan,

A followup question, As I mentioned earlier storing html and applying 
char-filter doesn't really work especially with highlighted fields coming 
back with weird html display. 
So, I am thinking stripping html before indexing, so no html in index and 
source, but I will add an extra field like "html_content" which meant to 
store the html version and not be indexed. 
Do you see any problems with my approach? I see one like big index size. 
What do you recommend for an ideal solution? I am still confused as I 
thought this would be a common problem?


On Friday, August 8, 2014 8:16:09 PM UTC-4, IronMan wrote:
>
> Thanks again. I wasn't expecting it to remove what's between the tags. I 
> believe I understand the behavior and maybe its the case where I was greedy 
> and expecting ElasticSearch to do it all.
> Here is a scenario that I was looking for: Assume I am looking to get an 
> excerpt of text (Extracted text from a document), Elastic Search query will 
> give me excerpt with html tags, but the tags are out of context, so I would 
> have liked to be to display this excerpt with no html tags, I know I can 
> probably strip the tags after the fact, but that's what I was trying to 
> avoid.  In other words, in a perfect world, I would have liked 2 versions 
> of the document, the original html one and another stripped one. When I 
> need to query things like excerpts, I would query the stripped one, and 
> when I needed the html, I would query the source. Hopefully I didn't make 
> this more confusing.
>
> On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote:
>>
>> The tokens that appear in the analyze API are the ones that are put into 
>> the inverted index. When you search for one of the terms that is not an 
>> HTML tag, there will be a match. What I don't understand after reading in 
>> detail your original, is exactly what behavior you are expecting.
>>
>> You indexed the phrase
>> <html>trying out <b>Elasticsearch</b>, This is an html test</html>
>>
>> but you expected a query for the term "html" to not match. However, the 
>> work "html" is clearly in the content. The html stripper will not remove 
>> the contents in between the tags, just the tags themselve. The analyze API 
>> should show you the correct term.
>>
>> Lucene has more control over what information you can retrieve, but the 
>> only way to get the analyzed token stream back from Elasticsearch is to use 
>> the analyze API on the field. Most people do not want an analyzed token 
>> stream, just the original field.
>>
>> -- 
>> Ivan
>>
>>
>> On Fri, Aug 8, 2014 at 12:01 PM, IronMike <sabda...@gmail.com> wrote:
>>
>>> Also, Here is a link for someone who had the same problem, I am not sure 
>>> if there was a final answer to that one. 
>>> http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip
>>> ,
>>> I have to admit that I am a bit confused now about this topic. I 
>>> understand analyzers will tokenize the sentence and strip html in the case 
>>> of the html_strip, and _analyze works fine using the analyzer, what I am 
>>> failing to understand, is how can I get the results of these tokens. Isn't 
>>> the whole idea to be able to search for them tokens eventually?
>>>
>>> If not, whats the solution of what I would think is a common scenario, 
>>> having to index html documents, where html tags don't need to be indexed, 
>>> while keeping the original html for presentational purpose? Any ideas 
>>> (Besides having to strip html tags manually before indexing?
>>>
>>>
>>> On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote:
>>>>
>>>> Thanks for explaining. So, is there a way to be able to get non html 
>>>> from the index? I thought I read that it was possible to index without the 
>>>> html tags while keeping source intact. So, how would I get at the index 
>>>> with non html tags if you will?
>>>>
>>>> On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote:
>>>>>
>>>>> The field is derived from the source and not generated from the tokens.
>>>>>
>>>>> If we indexed the sentence "The quick brown foxes jumped over the lazy 
>>>>> dogs" with the english analyzer, the tokens would be
>>>>>
>>>>> http://localhost:9200/_analyze?text=The%20quick%
>>>>> 20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english
>>>>>
>>>>> quick brown fox jump over lazi dog
>>>>>
>>>>> After applying stopwords and stemming, the tokens do not form a 
>>>>> sentence that looks like the original.
>>>>>
>>>>> -- 
>>>>> Ivan
>>>>>
>>>>>
>>>>> On Fri, Aug 8, 2014 at 9:42 AM, IronMike <sabda...@gmail.com> wrote:
>>>>>
>>>>>> Ivan,
>>>>>>
>>>>>> The search results I am showing is for the field "title" not for the 
>>>>>> source. I thought I could query the field not the source and look at it 
>>>>>> with no html while the source was intact. Did I misunderstand?
>>>>>>
>>>>>>
>>>>>> On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote:
>>>>>>
>>>>>>> The analyzers control how text is parsed/tokenized and how terms are 
>>>>>>> indexed in the inverted index. The source document remains untouched.
>>>>>>>
>>>>>>> -- 
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 8, 2014 at 9:24 AM, IronMike <sabda...@gmail.com> wrote:
>>>>>>>
>>>>>>>>  I also used Clint's example and tried to map it to a document and 
>>>>>>>> search the field, but still getting html in query results... Here is 
>>>>>>>> my 
>>>>>>>> code. I appreciate the help.
>>>>>>>>
>>>>>>>> //Tokenizer
>>>>>>>>
>>>>>>>> PUT /foo/
>>>>>>>> {
>>>>>>>>  "settings": {
>>>>>>>>    "index" : {
>>>>>>>>       "analysis" : {
>>>>>>>>          "analyzer" : {
>>>>>>>>             "test_1" : {
>>>>>>>>                "char_filter" : [
>>>>>>>>                   "html_strip"
>>>>>>>>                ],
>>>>>>>>                "tokenizer" : "standard"
>>>>>>>>             }
>>>>>>>>          }
>>>>>>>>       }
>>>>>>>>    }
>>>>>>>>  }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> //Mapping
>>>>>>>> PUT /foo/foo_type/_mapping
>>>>>>>> {
>>>>>>>>   "foo_type":{ 
>>>>>>>>          "properties" : {
>>>>>>>>                    "title": {
>>>>>>>>                          "type":"string",
>>>>>>>>                          "index": "analyzed", 
>>>>>>>>                          "analyzer":"test_1"
>>>>>>>>                          }
>>>>>>>>                        }
>>>>>>>>            }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Get /foo/foo_type/_mapping
>>>>>>>> {
>>>>>>>>    "foo": {
>>>>>>>>       "mappings": {
>>>>>>>>          "foo_type": {
>>>>>>>>             "properties": {
>>>>>>>>                "date": {
>>>>>>>>                   "type": "date",
>>>>>>>>                   "format": "dateOptionalTime"
>>>>>>>>                },
>>>>>>>>                "title": {
>>>>>>>>                   "type": "string",
>>>>>>>>                   "analyzer": "test_1"
>>>>>>>>                }
>>>>>>>>             }
>>>>>>>>          }
>>>>>>>>       }
>>>>>>>>    }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> ////Index/////////////
>>>>>>>> PUT /foo/foo_type/1
>>>>>>>> {
>>>>>>>>     "date" : "2009-11-15T14:12:12",
>>>>>>>>     "title" : "The quick & <b>brown</b> fox"
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> //Search //////////
>>>>>>>> GET /foo/_search?pretty:true
>>>>>>>> {
>>>>>>>>    "fields": ["title"], 
>>>>>>>>     "query": {
>>>>>>>>         "query_string": {
>>>>>>>>             "query": "brown",
>>>>>>>>             "analyzer": "test_1"
>>>>>>>>         }
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> //Results showing html tags still//////
>>>>>>>> "hits": [
>>>>>>>>          {
>>>>>>>>             "_index": "foo",
>>>>>>>>             "_type": "foo_type",
>>>>>>>>             "_id": "1",
>>>>>>>>             "_score": 0.076713204,
>>>>>>>>             "fields": {
>>>>>>>>                "title": [
>>>>>>>>                   "The quick & <b>brown</b> fox" 
>>>>>>>>                ]
>>>>>>>>             }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote:
>>>>>>>>
>>>>>>>>> Have you checked Clint's example?
>>>>>>>>>
>>>>>>>>> https://gist.github.com/clintongormley/780895
>>>>>>>>>  
>>>>>>>>> Jörg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>  I would like to strip html tags for indexing. Here is a simple 
>>>>>>>>>> example I tried so far, but doesn't seem to strip html tags. Any 
>>>>>>>>>> ideas 
>>>>>>>>>> what's missing?
>>>>>>>>>>
>>>>>>>>>> //settings & Mappings
>>>>>>>>>> POST twitter
>>>>>>>>>> {
>>>>>>>>>>   "mappings": {
>>>>>>>>>>     "tweet" : {
>>>>>>>>>>       "properties" : {
>>>>>>>>>>         "message" : {
>>>>>>>>>>           "type" :    "string",
>>>>>>>>>>           "analyzer": "strip_html_analyzer"
>>>>>>>>>>         },
>>>>>>>>>>         "date" : {
>>>>>>>>>>           "type" :   "date"
>>>>>>>>>>         },
>>>>>>>>>>         "name" : {
>>>>>>>>>>           "type" :   "string"
>>>>>>>>>>         }
>>>>>>>>>>       }
>>>>>>>>>>     }
>>>>>>>>>>   },
>>>>>>>>>>   "settings": {
>>>>>>>>>>     "analysis": {
>>>>>>>>>>       "analyzer": {
>>>>>>>>>>         "strip_html_analyzer":{
>>>>>>>>>>             "type":"custom",
>>>>>>>>>>             "tokenizer":"standard",
>>>>>>>>>>             "filter":"standard",
>>>>>>>>>>             "char_filter":"my_html"
>>>>>>>>>>         }
>>>>>>>>>>       },
>>>>>>>>>>       "char_filter": {
>>>>>>>>>>           "my_html":{
>>>>>>>>>>               "type":"html_strip"
>>>>>>>>>>           }
>>>>>>>>>>       }
>>>>>>>>>>     }
>>>>>>>>>>   }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Index a document
>>>>>>>>>> PUT /twitter/tweet/1
>>>>>>>>>> {
>>>>>>>>>>     "name" : "mike",
>>>>>>>>>>     "date" : "2009-11-15T14:12:12",
>>>>>>>>>>     "message" : "<html>trying out <b>Elasticsearch</b>, This is 
>>>>>>>>>> an html test</html>"
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //query result for "html", I expect the query to return nothing 
>>>>>>>>>> since it is supposed to strip the tag?
>>>>>>>>>> "hits": {
>>>>>>>>>>       "total": 1,
>>>>>>>>>>       "max_score": 0.11626227,
>>>>>>>>>>       "hits": [
>>>>>>>>>>          {
>>>>>>>>>>             "_index": "twitter",
>>>>>>>>>>             "_type": "tweet",
>>>>>>>>>>             "_id": "1",
>>>>>>>>>>             "_score": 0.11626227,
>>>>>>>>>>             "fields": {
>>>>>>>>>>                "message": [
>>>>>>>>>>                   "<html>trying out <b>Elasticsearch</b>, This is 
>>>>>>>>>> an html test</html>"
>>>>>>>>>>                ]
>>>>>>>>>>             },
>>>>>>>>>>             "highlight": {
>>>>>>>>>>                "message": [
>>>>>>>>>>                   "<html>trying out <b>Elasticsearch</b>, This is 
>>>>>>>>>> an <em>html</em> test</html>"
>>>>>>>>>>                ]
>>>>>>>>>>             }
>>>>>>>>>>          }
>>>>>>>>>>       ]
>>>>>>>>>>    }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "elasticsearch" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>>>>
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3
>>>>>>>>>> 8-4646-bc8f-a27896454515%40googlegroups.com 
>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "elasticsearch" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47
>>>>>>>> c-4c35-a40b-058e3c1b1043%40googlegroups.com 
>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%
>>>>>> 40googlegroups.com 
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/28cbd510-d31c-4ab1-bd4a-6a87eade7953%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: how to get char_filter to work?

Reply via email to