Ivan, A followup question, As I mentioned earlier storing html and applying char-filter doesn't really work especially with highlighted fields coming back with weird html display. So, I am thinking stripping html before indexing, so no html in index and source, but I will add an extra field like "html_content" which meant to store the html version and not be indexed. Do you see any problems with my approach? I see one like big index size. What do you recommend for an ideal solution? I am still confused as I thought this would be a common problem?
On Friday, August 8, 2014 8:16:09 PM UTC-4, IronMan wrote: > > Thanks again. I wasn't expecting it to remove what's between the tags. I > believe I understand the behavior and maybe its the case where I was greedy > and expecting ElasticSearch to do it all. > Here is a scenario that I was looking for: Assume I am looking to get an > excerpt of text (Extracted text from a document), Elastic Search query will > give me excerpt with html tags, but the tags are out of context, so I would > have liked to be to display this excerpt with no html tags, I know I can > probably strip the tags after the fact, but that's what I was trying to > avoid. In other words, in a perfect world, I would have liked 2 versions > of the document, the original html one and another stripped one. When I > need to query things like excerpts, I would query the stripped one, and > when I needed the html, I would query the source. Hopefully I didn't make > this more confusing. > > On Friday, August 8, 2014 4:58:03 PM UTC-4, Ivan Brusic wrote: >> >> The tokens that appear in the analyze API are the ones that are put into >> the inverted index. When you search for one of the terms that is not an >> HTML tag, there will be a match. What I don't understand after reading in >> detail your original, is exactly what behavior you are expecting. >> >> You indexed the phrase >> <html>trying out <b>Elasticsearch</b>, This is an html test</html> >> >> but you expected a query for the term "html" to not match. However, the >> work "html" is clearly in the content. The html stripper will not remove >> the contents in between the tags, just the tags themselve. The analyze API >> should show you the correct term. >> >> Lucene has more control over what information you can retrieve, but the >> only way to get the analyzed token stream back from Elasticsearch is to use >> the analyze API on the field. Most people do not want an analyzed token >> stream, just the original field. >> >> -- >> Ivan >> >> >> On Fri, Aug 8, 2014 at 12:01 PM, IronMike <sabda...@gmail.com> wrote: >> >>> Also, Here is a link for someone who had the same problem, I am not sure >>> if there was a final answer to that one. >>> http://grokbase.com/t/gg/elasticsearch/126r4kv8tx/problem-with-standard-html-strip >>> , >>> I have to admit that I am a bit confused now about this topic. I >>> understand analyzers will tokenize the sentence and strip html in the case >>> of the html_strip, and _analyze works fine using the analyzer, what I am >>> failing to understand, is how can I get the results of these tokens. Isn't >>> the whole idea to be able to search for them tokens eventually? >>> >>> If not, whats the solution of what I would think is a common scenario, >>> having to index html documents, where html tags don't need to be indexed, >>> while keeping the original html for presentational purpose? Any ideas >>> (Besides having to strip html tags manually before indexing? >>> >>> >>> On Friday, August 8, 2014 1:02:07 PM UTC-4, IronMike wrote: >>>> >>>> Thanks for explaining. So, is there a way to be able to get non html >>>> from the index? I thought I read that it was possible to index without the >>>> html tags while keeping source intact. So, how would I get at the index >>>> with non html tags if you will? >>>> >>>> On Friday, August 8, 2014 12:52:37 PM UTC-4, Ivan Brusic wrote: >>>>> >>>>> The field is derived from the source and not generated from the tokens. >>>>> >>>>> If we indexed the sentence "The quick brown foxes jumped over the lazy >>>>> dogs" with the english analyzer, the tokens would be >>>>> >>>>> http://localhost:9200/_analyze?text=The%20quick% >>>>> 20brown%20foxes%20jumped%20over%20the%20lazy%20dogs&analyzer=english >>>>> >>>>> quick brown fox jump over lazi dog >>>>> >>>>> After applying stopwords and stemming, the tokens do not form a >>>>> sentence that looks like the original. >>>>> >>>>> -- >>>>> Ivan >>>>> >>>>> >>>>> On Fri, Aug 8, 2014 at 9:42 AM, IronMike <sabda...@gmail.com> wrote: >>>>> >>>>>> Ivan, >>>>>> >>>>>> The search results I am showing is for the field "title" not for the >>>>>> source. I thought I could query the field not the source and look at it >>>>>> with no html while the source was intact. Did I misunderstand? >>>>>> >>>>>> >>>>>> On Friday, August 8, 2014 12:36:16 PM UTC-4, Ivan Brusic wrote: >>>>>> >>>>>>> The analyzers control how text is parsed/tokenized and how terms are >>>>>>> indexed in the inverted index. The source document remains untouched. >>>>>>> >>>>>>> -- >>>>>>> Ivan >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 8, 2014 at 9:24 AM, IronMike <sabda...@gmail.com> wrote: >>>>>>> >>>>>>>> I also used Clint's example and tried to map it to a document and >>>>>>>> search the field, but still getting html in query results... Here is >>>>>>>> my >>>>>>>> code. I appreciate the help. >>>>>>>> >>>>>>>> //Tokenizer >>>>>>>> >>>>>>>> PUT /foo/ >>>>>>>> { >>>>>>>> "settings": { >>>>>>>> "index" : { >>>>>>>> "analysis" : { >>>>>>>> "analyzer" : { >>>>>>>> "test_1" : { >>>>>>>> "char_filter" : [ >>>>>>>> "html_strip" >>>>>>>> ], >>>>>>>> "tokenizer" : "standard" >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> //Mapping >>>>>>>> PUT /foo/foo_type/_mapping >>>>>>>> { >>>>>>>> "foo_type":{ >>>>>>>> "properties" : { >>>>>>>> "title": { >>>>>>>> "type":"string", >>>>>>>> "index": "analyzed", >>>>>>>> "analyzer":"test_1" >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> Get /foo/foo_type/_mapping >>>>>>>> { >>>>>>>> "foo": { >>>>>>>> "mappings": { >>>>>>>> "foo_type": { >>>>>>>> "properties": { >>>>>>>> "date": { >>>>>>>> "type": "date", >>>>>>>> "format": "dateOptionalTime" >>>>>>>> }, >>>>>>>> "title": { >>>>>>>> "type": "string", >>>>>>>> "analyzer": "test_1" >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> ////Index///////////// >>>>>>>> PUT /foo/foo_type/1 >>>>>>>> { >>>>>>>> "date" : "2009-11-15T14:12:12", >>>>>>>> "title" : "The quick & <b>brown</b> fox" >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> //Search ////////// >>>>>>>> GET /foo/_search?pretty:true >>>>>>>> { >>>>>>>> "fields": ["title"], >>>>>>>> "query": { >>>>>>>> "query_string": { >>>>>>>> "query": "brown", >>>>>>>> "analyzer": "test_1" >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> //Results showing html tags still////// >>>>>>>> "hits": [ >>>>>>>> { >>>>>>>> "_index": "foo", >>>>>>>> "_type": "foo_type", >>>>>>>> "_id": "1", >>>>>>>> "_score": 0.076713204, >>>>>>>> "fields": { >>>>>>>> "title": [ >>>>>>>> "The quick & <b>brown</b> fox" >>>>>>>> ] >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, August 7, 2014 6:06:56 PM UTC-4, Jörg Prante wrote: >>>>>>>> >>>>>>>>> Have you checked Clint's example? >>>>>>>>> >>>>>>>>> https://gist.github.com/clintongormley/780895 >>>>>>>>> >>>>>>>>> Jörg >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 7, 2014 at 8:23 PM, IronMike <sabda...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I would like to strip html tags for indexing. Here is a simple >>>>>>>>>> example I tried so far, but doesn't seem to strip html tags. Any >>>>>>>>>> ideas >>>>>>>>>> what's missing? >>>>>>>>>> >>>>>>>>>> //settings & Mappings >>>>>>>>>> POST twitter >>>>>>>>>> { >>>>>>>>>> "mappings": { >>>>>>>>>> "tweet" : { >>>>>>>>>> "properties" : { >>>>>>>>>> "message" : { >>>>>>>>>> "type" : "string", >>>>>>>>>> "analyzer": "strip_html_analyzer" >>>>>>>>>> }, >>>>>>>>>> "date" : { >>>>>>>>>> "type" : "date" >>>>>>>>>> }, >>>>>>>>>> "name" : { >>>>>>>>>> "type" : "string" >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> }, >>>>>>>>>> "settings": { >>>>>>>>>> "analysis": { >>>>>>>>>> "analyzer": { >>>>>>>>>> "strip_html_analyzer":{ >>>>>>>>>> "type":"custom", >>>>>>>>>> "tokenizer":"standard", >>>>>>>>>> "filter":"standard", >>>>>>>>>> "char_filter":"my_html" >>>>>>>>>> } >>>>>>>>>> }, >>>>>>>>>> "char_filter": { >>>>>>>>>> "my_html":{ >>>>>>>>>> "type":"html_strip" >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> //Index a document >>>>>>>>>> PUT /twitter/tweet/1 >>>>>>>>>> { >>>>>>>>>> "name" : "mike", >>>>>>>>>> "date" : "2009-11-15T14:12:12", >>>>>>>>>> "message" : "<html>trying out <b>Elasticsearch</b>, This is >>>>>>>>>> an html test</html>" >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> //query result for "html", I expect the query to return nothing >>>>>>>>>> since it is supposed to strip the tag? >>>>>>>>>> "hits": { >>>>>>>>>> "total": 1, >>>>>>>>>> "max_score": 0.11626227, >>>>>>>>>> "hits": [ >>>>>>>>>> { >>>>>>>>>> "_index": "twitter", >>>>>>>>>> "_type": "tweet", >>>>>>>>>> "_id": "1", >>>>>>>>>> "_score": 0.11626227, >>>>>>>>>> "fields": { >>>>>>>>>> "message": [ >>>>>>>>>> "<html>trying out <b>Elasticsearch</b>, This is >>>>>>>>>> an html test</html>" >>>>>>>>>> ] >>>>>>>>>> }, >>>>>>>>>> "highlight": { >>>>>>>>>> "message": [ >>>>>>>>>> "<html>trying out <b>Elasticsearch</b>, This is >>>>>>>>>> an <em>html</em> test</html>" >>>>>>>>>> ] >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> ] >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "elasticsearch" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>>>>> >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b3 >>>>>>>>>> 8-4646-bc8f-a27896454515%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/517fe8b8-0b38-4646-bc8f-a27896454515%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "elasticsearch" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47 >>>>>>>> c-4c35-a40b-058e3c1b1043%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/a831f6f4-b47c-4c35-a40b-058e3c1b1043%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4% >>>>>> 40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/ffecae0a-0d08-4a76-9717-dee201794be4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/elasticsearch/c81cd2e6-3ce8-4257-b521-c9881e36137f%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/28cbd510-d31c-4ab1-bd4a-6a87eade7953%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.