Thanks again. I wasn't expecting it to remove what's between the tags. I believe I understand the behavior and maybe its the case where I was greedy and expecting ElasticSearch to do it all. Here is a scenario that I was looking for: Assume I am looking to get an excerpt of text (Extracted text from a document), Elastic Search query will give me excerpt with html tags, but the tags are out of context, so I would have liked to be to display this excerpt with no html tags, I know I can probably strip the tags after the fact, but that's what I was trying to avoid. In other words, in a perfect world, I would have liked 2 versions of the document, the original html one and another stripped one. When I need to query things like excerpts, I would query the stripped one, and when I needed the html, I would query the source. Hopefully I didn't make this more confusing.
On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote: > > The tokens that appear in the analyze API are the ones that are put into > the inverted index. When you search for one of the terms that is not an > HTML tag, there will be a match. What I don't understand after reading in > detail your original, is exactly what behavior you are expecting. > > You indexed the phrase > <html>trying out <b>Elasticsearch</b>, This is an html test</html> > > but you expected a query for the term "html" to not match. However, the > work "html" is clearly in the content. The html stripper will not remove > the contents in between the tags, just the tags themselve. The analyze API > should show you the correct term. > > Lucene has more control over what information you can retrieve, but the > only way to get the analyzed token stream back from Elasticsearch is to use > the analyze API on the field. Most people do not want an analyzed token > stream, just the original field. > > -- > Ivan > > > On Fri, Aug 8, 2014 at 12:01 PM, IronMike <sabda...@gmail.com > <javascript:>> wrote: > >> Also, Here is a link for someone who had the same problem, I am not sure >> if there was a final answer to that one. >> http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip >> , >> I have to admit that I am a bit confused now about this topic. I >> understand analyzers will tokenize the sentence and strip html in the case >> of the html_strip, and _analyze works fine using the analyzer, what I am >> failing to understand, is how can I get the results of these tokens. Isn't >> the whole idea to be able to search for them tokens eventually? >> >> If not, whats the solution of what I would think is a common scenario, >> having to index html documents, where html tags don't need to be indexed, >> while keeping the original html for presentational purpose? Any ideas >> (Besides having to strip html tags manually before indexing? >> >> >> On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote: >>> >>> Thanks for explaining. So, is there a way to be able to get non html >>> from the index? I thought I read that it was possible to index without the >>> html tags while keeping source intact. So, how would I get at the index >>> with non html tags if you will? >>> >>> On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote: >>>> >>>> The field is derived from the source and not generated from the tokens. >>>> >>>> If we indexed the sentence "The quick brown foxes jumped over the lazy >>>> dogs" with the english analyzer, the tokens would be >>>> >>>> http://localhost:9200/_analyze?text=The%20quick% >>>> 20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english >>>> >>>> quick brown fox jump over lazi dog >>>> >>>> After applying stopwords and stemming, the tokens do not form a >>>> sentence that looks like the original. >>>> >>>> -- >>>> Ivan >>>> >>>> >>>> On Fri, Aug 8, 2014 at 9:42 AM, IronMike <sabda...@gmail.com> wrote: >>>> >>>>> Ivan, >>>>> >>>>> The search results I am showing is for the field "title" not for the >>>>> source. I thought I could query the field not the source and look at it >>>>> with no html while the source was intact. Did I misunderstand? >>>>> >>>>> >>>>> On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote: >>>>> >>>>>> The analyzers control how text is parsed/tokenized and how terms are >>>>>> indexed in the inverted index. The source document remains untouched. >>>>>> >>>>>> -- >>>>>> Ivan >>>>>> >>>>>> >>>>>> On Fri, Aug 8, 2014 at 9:24 AM, IronMike <sabda...@gmail.com> wrote: >>>>>> >>>>>>> I also used Clint's example and tried to map it to a document and >>>>>>> search the field, but still getting html in query results... Here is my >>>>>>> code. I appreciate the help. >>>>>>> >>>>>>> //Tokenizer >>>>>>> >>>>>>> PUT /foo/ >>>>>>> { >>>>>>> "settings": { >>>>>>> "index" : { >>>>>>> "analysis" : { >>>>>>> "analyzer" : { >>>>>>> "test_1" : { >>>>>>> "char_filter" : [ >>>>>>> "html_strip" >>>>>>> ], >>>>>>> "tokenizer" : "standard" >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> //Mapping >>>>>>> PUT /foo/foo_type/_mapping >>>>>>> { >>>>>>> "foo_type":{ >>>>>>> "properties" : { >>>>>>> "title": { >>>>>>> "type":"string", >>>>>>> "index": "analyzed", >>>>>>> "analyzer":"test_1" >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Get /foo/foo_type/_mapping >>>>>>> { >>>>>>> "foo": { >>>>>>> "mappings": { >>>>>>> "foo_type": { >>>>>>> "properties": { >>>>>>> "date": { >>>>>>> "type": "date", >>>>>>> "format": "dateOptionalTime" >>>>>>> }, >>>>>>> "title": { >>>>>>> "type": "string", >>>>>>> "analyzer": "test_1" >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> ////Index///////////// >>>>>>> PUT /foo/foo_type/1 >>>>>>> { >>>>>>> "date" : "2009-11-15T14:12:12", >>>>>>> "title" : "The quick & <b>brown</b> fox" >>>>>>> } >>>>>>> >>>>>>> >>>>>>> //Search ////////// >>>>>>> GET /foo/_search?pretty:true >>>>>>> { >>>>>>> "fields": ["title"], >>>>>>> "query": { >>>>>>> "query_string": { >>>>>>> "query": "brown", >>>>>>> "analyzer": "test_1" >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> //Results showing html tags still////// >>>>>>> "hits": [ >>>>>>> { >>>>>>> "_index": "foo", >>>>>>> "_type": "foo_type", >>>>>>> "_id": "1", >>>>>>> "_score": 0.076713204, >>>>>>> "fields": { >>>>>>> "title": [ >>>>>>> "The quick & <b>brown</b> fox" >>>>>>> ] >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote: >>>>>>> >>>>>>>> Have you checked Clint's example? >>>>>>>> >>>>>>>> https://gist.github.com/clintongormley/780895 >>>>>>>> >>>>>>>> Jörg >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I would like to strip html tags for indexing. Here is a simple >>>>>>>>> example I tried so far, but doesn't seem to strip html tags. Any >>>>>>>>> ideas >>>>>>>>> what's missing? >>>>>>>>> >>>>>>>>> //settings & Mappings >>>>>>>>> POST twitter >>>>>>>>> { >>>>>>>>> "mappings": { >>>>>>>>> "tweet" : { >>>>>>>>> "properties" : { >>>>>>>>> "message" : { >>>>>>>>> "type" : "string", >>>>>>>>> "analyzer": "strip_html_analyzer" >>>>>>>>> }, >>>>>>>>> "date" : { >>>>>>>>> "type" : "date" >>>>>>>>> }, >>>>>>>>> "name" : { >>>>>>>>> "type" : "string" >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> }, >>>>>>>>> "settings": { >>>>>>>>> "analysis": { >>>>>>>>> "analyzer": { >>>>>>>>> "strip_html_analyzer":{ >>>>>>>>> "type":"custom", >>>>>>>>> "tokenizer":"standard", >>>>>>>>> "filter":"standard", >>>>>>>>> "char_filter":"my_html" >>>>>>>>> } >>>>>>>>> }, >>>>>>>>> "char_filter": { >>>>>>>>> "my_html":{ >>>>>>>>> "type":"html_strip" >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> //Index a document >>>>>>>>> PUT /twitter/tweet/1 >>>>>>>>> { >>>>>>>>> "name" : "mike", >>>>>>>>> "date" : "2009-11-15T14:12:12", >>>>>>>>> "message" : "<html>trying out <b>Elasticsearch</b>, This is an >>>>>>>>> html test</html>" >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> //query result for "html", I expect the query to return nothing >>>>>>>>> since it is supposed to strip the tag? >>>>>>>>> "hits": { >>>>>>>>> "total": 1, >>>>>>>>> "max_score": 0.11626227, >>>>>>>>> "hits": [ >>>>>>>>> { >>>>>>>>> "_index": "twitter", >>>>>>>>> "_type": "tweet", >>>>>>>>> "_id": "1", >>>>>>>>> "_score": 0.11626227, >>>>>>>>> "fields": { >>>>>>>>> "message": [ >>>>>>>>> "<html>trying out <b>Elasticsearch</b>, This is >>>>>>>>> an html test</html>" >>>>>>>>> ] >>>>>>>>> }, >>>>>>>>> "highlight": { >>>>>>>>> "message": [ >>>>>>>>> "<html>trying out <b>Elasticsearch</b>, This is >>>>>>>>> an <em>html</em> test</html>" >>>>>>>>> ] >>>>>>>>> } >>>>>>>>> } >>>>>>>>> ] >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "elasticsearch" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>>>> >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3 >>>>>>>>> 8-4646-bc8f-a27896454515%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47 >>>>>>> c-4c35-a40b-058e3c1b1043%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to elasticsearc...@googlegroups.com. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4% >>>>> 40googlegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/99b703a3-34df-4e96-8c8e-5f692b60ab09%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.