Re: problem indexing with my analyzer
Information My note_source contain picture (.jpg, .png ...) in base64 and text. For my mapping I have used : type = string analyzer = reuteurs (the name of my analyzer) Any idea ? Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : Hello I have some issue, when I index a particular data note_source (sql longtext). I use the same analyzer for each fields (except date_source and id_source) but for note_source, I have a warn monitor.jvm. When I remove note_source, everything fine. If I don't use analyzer on note_source, everything fine, but if I use my analyzer on note_source I have some crash. I think I have enough memory, I have used ES_HEAP_SIZE. Maybe my problem it's with accent (ascii, utf-8) Can you help me with this ? *My Setting* public function createSetting($pf){ $params = array('index' = $pf, 'body' = array( 'settings' = array( 'number_of_shards' = 5, 'number_of_replicas' = 0, 'analysis' = array( 'filter' = array( 'nGram' = array( token_chars =array(), type = nGram, min_gram = 3, max_gram = 250 ) ), 'analyzer' = array( 'reuters' = array( 'type' = 'custom', 'tokenizer' = 'standard', 'filter' = array('lowercase', 'asciifolding', 'nGram') ) ) ) ) )); $this-elasticsearchClient-indices()-create($params); return; } *My Indexing* public function indexTable($pf,$typeElement){ $params =array( index ='_river', type = $typeElement, id = _meta, body =array( type = jdbc, jdbc = array( url = jdbc:mysql://ip/name, user = 'root', password = 'mdp', index = $pf, type = $typeElement, sql = select id_source as _id, id_sous_theme, titre_source, desc_source, note_source, adresse_source, type_source, date_source from source, max_bulk_requests = 5, ) ) ); $this-elasticsearchClient-index($params); } Thanks in advance. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: problem indexing with my analyzer
Does it mean your applying the reuters analyzer on your base64 encoded pictures? I guess it generates a really huge number of tokens for each entry because of your nGram filter (with a max at 250). Cédric Hourcade c...@wal.fr On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard bernardtanguy1...@gmail.com wrote: Information My note_source contain picture (.jpg, .png ...) in base64 and text. For my mapping I have used : type = string analyzer = reuteurs (the name of my analyzer) Any idea ? Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : Hello I have some issue, when I index a particular data note_source (sql longtext). I use the same analyzer for each fields (except date_source and id_source) but for note_source, I have a warn monitor.jvm. When I remove note_source, everything fine. If I don't use analyzer on note_source, everything fine, but if I use my analyzer on note_source I have some crash. I think I have enough memory, I have used ES_HEAP_SIZE. Maybe my problem it's with accent (ascii, utf-8) Can you help me with this ? My Setting public function createSetting($pf){ $params = array('index' = $pf, 'body' = array( 'settings' = array( 'number_of_shards' = 5, 'number_of_replicas' = 0, 'analysis' = array( 'filter' = array( 'nGram' = array( token_chars =array(), type = nGram, min_gram = 3, max_gram = 250 ) ), 'analyzer' = array( 'reuters' = array( 'type' = 'custom', 'tokenizer' = 'standard', 'filter' = array('lowercase', 'asciifolding', 'nGram') ) ) ) ) )); $this-elasticsearchClient-indices()-create($params); return; } My Indexing public function indexTable($pf,$typeElement){ $params =array( index ='_river', type = $typeElement, id = _meta, body =array( type = jdbc, jdbc = array( url = jdbc:mysql://ip/name, user = 'root', password = 'mdp', index = $pf, type = $typeElement, sql = select id_source as _id, id_sous_theme, titre_source, desc_source, note_source, adresse_source, type_source, date_source from source, max_bulk_requests = 5, ) ) ); $this-elasticsearchClient-index($params); } Thanks in advance. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPM8qvsmcxB7Xu4KqN28pfvk%2BcBn5bpV2Emw42M5HzAAUA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: problem indexing with my analyzer
Yes, I am applying reuters on my document (compose by text and picture). My goal is to do my research on the text of the document with any word or part of a word. Yes the problem it's my nGram filter. How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an other but who satisfy my goal ? Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit : Does it mean your applying the reuters analyzer on your base64 encoded pictures? I guess it generates a really huge number of tokens for each entry because of your nGram filter (with a max at 250). Cédric Hourcade c...@wal.fr javascript: On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard bernardt...@gmail.com javascript: wrote: Information My note_source contain picture (.jpg, .png ...) in base64 and text. For my mapping I have used : type = string analyzer = reuteurs (the name of my analyzer) Any idea ? Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : Hello I have some issue, when I index a particular data note_source (sql longtext). I use the same analyzer for each fields (except date_source and id_source) but for note_source, I have a warn monitor.jvm. When I remove note_source, everything fine. If I don't use analyzer on note_source, everything fine, but if I use my analyzer on note_source I have some crash. I think I have enough memory, I have used ES_HEAP_SIZE. Maybe my problem it's with accent (ascii, utf-8) Can you help me with this ? My Setting public function createSetting($pf){ $params = array('index' = $pf, 'body' = array( 'settings' = array( 'number_of_shards' = 5, 'number_of_replicas' = 0, 'analysis' = array( 'filter' = array( 'nGram' = array( token_chars =array(), type = nGram, min_gram = 3, max_gram = 250 ) ), 'analyzer' = array( 'reuters' = array( 'type' = 'custom', 'tokenizer' = 'standard', 'filter' = array('lowercase', 'asciifolding', 'nGram') ) ) ) ) )); $this-elasticsearchClient-indices()-create($params); return; } My Indexing public function indexTable($pf,$typeElement){ $params =array( index ='_river', type = $typeElement, id = _meta, body =array( type = jdbc, jdbc = array( url = jdbc:mysql://ip/name, user = 'root', password = 'mdp', index = $pf, type = $typeElement, sql = select id_source as _id, id_sous_theme, titre_source, desc_source, note_source, adresse_source, type_source, date_source from source, max_bulk_requests = 5, ) ) ); $this-elasticsearchClient-index($params); } Thanks in advance. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: problem indexing with my analyzer
I set max_gram=20. It's better but at the end I have this many times : [2014-06-20 11:42:14,201][WARN ][monitor.jvm ] [ik-test2] [gc][young][528][263] duration [2s], collections [1]/[2.1s], total [2s]/[43.9s], memory [536mb]-[580.2mb]/[1015.6mb], all_pools {[young] [22.5mb]-[22.3mb]/[66.5mb]}{[survivor] [14.9kb]-[49.3kb]/[8.3mb]}{[old] [513.4mb]-[557.8mb]/[940.8mb]} I put ES_HEAP_SIZE : 2G. I think it's enough. Something wrong ? Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : Hello I have some issue, when I index a particular data note_source (sql longtext). I use the same analyzer for each fields (except date_source and id_source) but for note_source, I have a warn monitor.jvm. When I remove note_source, everything fine. If I don't use analyzer on note_source, everything fine, but if I use my analyzer on note_source I have some crash. I think I have enough memory, I have used ES_HEAP_SIZE. Maybe my problem it's with accent (ascii, utf-8) Can you help me with this ? *My Setting* public function createSetting($pf){ $params = array('index' = $pf, 'body' = array( 'settings' = array( 'number_of_shards' = 5, 'number_of_replicas' = 0, 'analysis' = array( 'filter' = array( 'nGram' = array( token_chars =array(), type = nGram, min_gram = 3, max_gram = 250 ) ), 'analyzer' = array( 'reuters' = array( 'type' = 'custom', 'tokenizer' = 'standard', 'filter' = array('lowercase', 'asciifolding', 'nGram') ) ) ) ) )); $this-elasticsearchClient-indices()-create($params); return; } *My Indexing* public function indexTable($pf,$typeElement){ $params =array( index ='_river', type = $typeElement, id = _meta, body =array( type = jdbc, jdbc = array( url = jdbc:mysql://ip/name, user = 'root', password = 'mdp', index = $pf, type = $typeElement, sql = select id_source as _id, id_sous_theme, titre_source, desc_source, note_source, adresse_source, type_source, date_source from source, max_bulk_requests = 5, ) ) ); $this-elasticsearchClient-index($params); } Thanks in advance. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/154b8ca2-a130-4062-b5ce-0e0fa63d98fe%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: problem indexing with my analyzer
The user copy/paste the content of an html page and me, I index this information. I take the entire document with image. I can't change this behavior. I set max_gram=20. It's better but at the end I have this many times : [2014-06-20 11:42:14,201][WARN ][monitor.jvm ] [ik-test2] [gc][young][528][263] duration [2s], collections [1]/[2.1s], total [2s]/[43.9s], memory [536mb]-[580.2mb]/[1015.6mb], all_pools {[young] [22.5mb]-[22.3mb]/[66.5mb]}{[survivor] [14.9kb]-[49.3kb]/[8.3mb]}{[old] [513.4mb]-[557.8mb]/[940.8mb]} I put ES_HEAP_SIZE : 2G. I think it's enough. Something wrong ? Le vendredi 20 juin 2014 11:45:22 UTC+2, Cédric Hourcade a écrit : If you are only searching in the text you should index the images in an other field field. With no analyzer (index: not_analyzed), or even better index: no (not indexed). If you need to retrieve the image data it's still in the _source. But to be honest I wouldn't even store this kind of information in ES, your index is going to be bigger, merges are going to be slower... I'd keep the binary files stored elsewhere. Cédric Hourcade c...@wal.fr javascript: On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard bernardt...@gmail.com javascript: wrote: Yes, I am applying reuters on my document (compose by text and picture). My goal is to do my research on the text of the document with any word or part of a word. Yes the problem it's my nGram filter. How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an other but who satisfy my goal ? Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit : Does it mean your applying the reuters analyzer on your base64 encoded pictures? I guess it generates a really huge number of tokens for each entry because of your nGram filter (with a max at 250). Cédric Hourcade c...@wal.fr On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard bernardt...@gmail.com wrote: Information My note_source contain picture (.jpg, .png ...) in base64 and text. For my mapping I have used : type = string analyzer = reuteurs (the name of my analyzer) Any idea ? Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : Hello I have some issue, when I index a particular data note_source (sql longtext). I use the same analyzer for each fields (except date_source and id_source) but for note_source, I have a warn monitor.jvm. When I remove note_source, everything fine. If I don't use analyzer on note_source, everything fine, but if I use my analyzer on note_source I have some crash. I think I have enough memory, I have used ES_HEAP_SIZE. Maybe my problem it's with accent (ascii, utf-8) Can you help me with this ? My Setting public function createSetting($pf){ $params = array('index' = $pf, 'body' = array( 'settings' = array( 'number_of_shards' = 5, 'number_of_replicas' = 0, 'analysis' = array( 'filter' = array( 'nGram' = array( token_chars =array(), type = nGram, min_gram = 3, max_gram = 250 ) ), 'analyzer' = array( 'reuters' = array( 'type' = 'custom', 'tokenizer' = 'standard', 'filter' = array('lowercase', 'asciifolding', 'nGram') ) ) ) ) )); $this-elasticsearchClient-indices()-create($params); return; } My Indexing public function indexTable($pf,$typeElement){ $params =array( index ='_river', type = $typeElement, id = _meta, body =array( type = jdbc, jdbc = array( url = jdbc:mysql://ip/name, user = 'root', password = 'mdp', index = $pf, type = $typeElement, sql = select id_source as _id, id_sous_theme, titre_source, desc_source, note_source, adresse_source, type_source, date_source from source, max_bulk_requests = 5, ) ) ); $this-elasticsearchClient-index($params); } Thanks in advance. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send
Re: problem indexing with my analyzer
If your base64 encodes are long, they are going to be splited in a lot of tokens by the standard tokenizer. Theses tokens are often going to be a lot longer than standard words, so your nGram filter will generate even more tokens, a lot more than with standard text. That may be your problem there. You should really try to strip the encoded images with a simple regex from your documents before indexing them. If you need to keep the source, put the raw text in an unindexed field, and the cleaned one in another. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPPD4UXAjX%2Buwi84LSsPeiy0C80uzcb4C1QFxwLzfyjQGA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: problem indexing with my analyzer
Thank you Cédric Hourcade ! Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit : If your base64 encodes are long, they are going to be splited in a lot of tokens by the standard tokenizer. Theses tokens are often going to be a lot longer than standard words, so your nGram filter will generate even more tokens, a lot more than with standard text. That may be your problem there. You should really try to strip the encoded images with a simple regex from your documents before indexing them. If you need to keep the source, put the raw text in an unindexed field, and the cleaned one in another. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: problem indexing with my analyzer
You seriously don't want 3..250 length ngrams That's ENORMOUS Typically set min/max to 3 or 4, and that's it http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching On 20 June 2014 16:05, Tanguy Bernard bernardtanguy1...@gmail.com wrote: Thank you Cédric Hourcade ! Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit : If your base64 encodes are long, they are going to be splited in a lot of tokens by the standard tokenizer. Theses tokens are often going to be a lot longer than standard words, so your nGram filter will generate even more tokens, a lot more than with standard text. That may be your problem there. You should really try to strip the encoded images with a simple regex from your documents before indexing them. If you need to keep the source, put the raw text in an unindexed field, and the cleaned one in another. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRS_zD%3DkVpKBpqp3hkcgJacAWsETGgJwMQJM%2BqJMuvscw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
problem indexing with my analyzer
Hello I have some issue, when I index a particular data note_source (sql longtext). I use the same analyzer for each fields (except date_source and id_source) but for note_source, I have a warn monitor.jvm. When I remove note_source, everything fine. If I don't use analyzer on note_source, everything fine, but if I use my analyzer on note_source I have some crash. I think I have enough memory, I have used ES_HEAP_SIZE. Maybe my problem it's with accent (ascii, utf-8) Can you help me with this ? *My Setting* public function createSetting($pf){ $params = array('index' = $pf, 'body' = array( 'settings' = array( 'number_of_shards' = 5, 'number_of_replicas' = 0, 'analysis' = array( 'filter' = array( 'nGram' = array( token_chars =array(), type = nGram, min_gram = 3, max_gram = 250 ) ), 'analyzer' = array( 'reuters' = array( 'type' = 'custom', 'tokenizer' = 'standard', 'filter' = array('lowercase', 'asciifolding', 'nGram') ) ) ) ) )); $this-elasticsearchClient-indices()-create($params); return; } *My Indexing* public function indexTable($pf,$typeElement){ $params =array( index ='_river', type = $typeElement, id = _meta, body =array( type = jdbc, jdbc = array( url = jdbc:mysql://ip/name, user = 'root', password = 'mdp', index = $pf, type = $typeElement, sql = select id_source as _id, id_sous_theme, titre_source, desc_source, note_source, adresse_source, type_source, date_source from source, max_bulk_requests = 5, ) ) ); $this-elasticsearchClient-index($params); } Thanks in advance. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd6e60dc-d394-4d7d-b994-2105002d7bd7%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.