Re: problem indexing with my analyzer

2014-06-25 Thread Tanguy Bernard
Yes I did not know how nGram works !
I find a perfect solution for my picture (base64) problem : use *'char_filter' 
=>array('html_strip'),*


public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'MYnGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram"  => 20
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding', 
'MYnGram'),
'char_filter' =>array('html_strip'),
),

)
)
)
));
$this->elasticsearchClient->indices()->create($params);

   }

Thanks to all of you !


Le samedi 21 juin 2014 00:35:39 UTC+2, Clinton Gormley a écrit :
>
> You seriously don't want 3..250 length ngrams That's ENORMOUS
>
> Typically set min/max to 3 or 4, and that's it
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching
>
>
> On 20 June 2014 16:05, Tanguy Bernard 
> > wrote:
>
>> Thank you Cédric Hourcade !
>>
>> Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :
>>
>>> If your base64 encodes are long, they are going to be splited in a lot 
>>> of tokens by the standard tokenizer. 
>>>
>>> Theses tokens are often going to be a lot longer than standard words, 
>>> so your nGram filter will generate even more tokens, a lot more than 
>>> with standard text. That may be your problem there. 
>>>
>>> You should really try to strip the encoded images with a simple regex 
>>> from your documents before indexing them. If you need to keep the 
>>> source, put the raw text in an unindexed field, and the cleaned one in 
>>> another. 
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
>>  
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2bdd5f30-8e97-43e0-8478-08cc26a03ed9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Clinton Gormley
You seriously don't want 3..250 length ngrams That's ENORMOUS

Typically set min/max to 3 or 4, and that's it

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching


On 20 June 2014 16:05, Tanguy Bernard  wrote:

> Thank you Cédric Hourcade !
>
> Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :
>
>> If your base64 encodes are long, they are going to be splited in a lot
>> of tokens by the standard tokenizer.
>>
>> Theses tokens are often going to be a lot longer than standard words,
>> so your nGram filter will generate even more tokens, a lot more than
>> with standard text. That may be your problem there.
>>
>> You should really try to strip the encoded images with a simple regex
>> from your documents before indexing them. If you need to keep the
>> source, put the raw text in an unindexed field, and the cleaned one in
>> another.
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRS_zD%3DkVpKBpqp3hkcgJacAWsETGgJwMQJM%2BqJMuvscw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
Thank you Cédric Hourcade !

Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :
>
> If your base64 encodes are long, they are going to be splited in a lot 
> of tokens by the standard tokenizer. 
>
> Theses tokens are often going to be a lot longer than standard words, 
> so your nGram filter will generate even more tokens, a lot more than 
> with standard text. That may be your problem there. 
>
> You should really try to strip the encoded images with a simple regex 
> from your documents before indexing them. If you need to keep the 
> source, put the raw text in an unindexed field, and the cleaned one in 
> another. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Cédric Hourcade
If your base64 encodes are long, they are going to be splited in a lot
of tokens by the standard tokenizer.

Theses tokens are often going to be a lot longer than standard words,
so your nGram filter will generate even more tokens, a lot more than
with standard text. That may be your problem there.

You should really try to strip the encoded images with a simple regex
from your documents before indexing them. If you need to keep the
source, put the raw text in an unindexed field, and the cleaned one in
another.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJQxjPPD4UXAjX%2Buwi84LSsPeiy0C80uzcb4C1QFxwLzfyjQGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
The user copy/paste the content of an html page and me, I index this 
information. I take the entire document with image. I can't change this 
behavior.

I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm  ] [ik-test2] 
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total 
[2s]/[43.9s], memory [536mb]->[580.2mb]/[1015.6mb], all_pools {[young] 
[22.5mb]->[22.3mb]/[66.5mb]}{[survivor] [14.9kb]->[49.3kb]/[8.3mb]}{[old] 
[513.4mb]->[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?

Le vendredi 20 juin 2014 11:45:22 UTC+2, Cédric Hourcade a écrit :
>
> If you are only searching in the text you should index the images in 
> an other field field. With no analyzer ("index: not_analyzed"), or 
> even better "index: no" (not indexed). If you need to retrieve the 
> image data it's still in the _source. 
>
> But to be honest I wouldn't even store this kind of information in ES, 
> your index is going to be bigger, merges are going to be slower... I'd 
> keep the binary files stored elsewhere. 
>
> Cédric Hourcade 
> c...@wal.fr  
>
>
> On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard 
> > wrote: 
> > Yes, I am applying "reuters" on my document (compose by text and 
> picture). 
> > My goal is to do my research on the text of the document with any word 
> or 
> > part of a word. 
> > 
> > Yes the problem it's my nGram filter. 
> > How do I solve this problem ? Deacrease nGram max ? Change Analyzer by 
> an 
> > other but who satisfy my goal ? 
> > 
> > Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit : 
> >> 
> >> Does it mean your applying the "reuters" analyzer on your base64 
> >> encoded pictures? 
> >> 
> >> I guess it generates a really huge number of tokens for each entry 
> >> because of your nGram filter (with a max at 250). 
> >> 
> >> Cédric Hourcade 
> >> c...@wal.fr 
> >> 
> >> 
> >> On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard 
> >>  wrote: 
> >> > Information 
> >> > My "note_source" contain picture (.jpg, .png ...) in base64 and text. 
> >> > 
> >> > For my mapping I have used : 
> >> > "type" => "string" 
> >> > "analyzer" => "reuteurs" (the name of my analyzer) 
> >> > 
> >> > 
> >> > Any idea ? 
> >> > 
> >> > Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : 
> >> >> 
> >> >> Hello 
> >> >> I have some issue, when I index a particular data "note_source" (sql 
> >> >> longtext). 
> >> >> I use the same analyzer for each fields (except date_source and 
> >> >> id_source) 
> >> >> but for "note_source", I have a "warn monitor.jvm". 
> >> >> When I remove "note_source", everything fine. If I don't use 
> analyzer 
> >> >> on 
> >> >> "note_source", everything fine, but if I use my analyzer on 
> >> >> "note_source" I 
> >> >> have some crash. 
> >> >> 
> >> >> I think I have enough memory, I have used ES_HEAP_SIZE. 
> >> >> Maybe my problem it's with accent (ascii, utf-8) 
> >> >> 
> >> >> Can you help me with this ? 
> >> >> 
> >> >> 
> >> >> 
> >> >> My Setting 
> >> >> 
> >> >>  public function createSetting($pf){ 
> >> >> $params = array('index' => $pf, 'body' => array( 
> >> >> 'settings' => array( 
> >> >> 'number_of_shards' => 5, 
> >> >> 'number_of_replicas' => 0, 
> >> >> 'analysis' => array( 
> >> >> 'filter' => array( 
> >> >> 'nGram' => array( 
> >> >> "token_chars" =>array(), 
> >> >> "type" => "nGram", 
> >> >> "min_gram" => 3, 
> >> >> "max_gram"  => 250 
> >> >> ) 
> >> >> ), 
> >> >> 'analyzer' => array( 
> >> >> 'reuters' => array( 
> >> >> 'type' => 'custom', 
> >> >> 'tokenizer' => 'standard', 
> >> >> 'filter' => array('lowercase', 
> 'asciifolding', 
> >> >> 'nGram') 
> >> >> ) 
> >> >> ) 
> >> >> ) 
> >> >> ) 
> >> >> )); 
> >> >> $this->elasticsearchClient->indices()->create($params); 
> >> >> return; 
> >> >> } 
> >> >> 
> >> >> 
> >> >> My Indexing 
> >> >> 
> >> >> public function indexTable($pf,$typeElement){ 
> >> >> 
> >> >> $params =array( 
> >> >> "index" =>'_river', 
> >> >> "type" => $typeElement, 
> >> >> "id" => "_meta", 
> >> >> "body" =>array( 
> >> >> 
> >> >> "type" => "jdbc", 
> >> >> "jdbc" => array( 
> >> >> "url" => "jdbc:mysql://ip/name", 
> >> >> "user" => 'root', 
> >> >> "password" => 'mdp', 
> >> >> "index" => $pf, 
> >> >> "type" => $typeElement, 
> >> >> "sql" => select id_source as _id, id_

Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm  ] [ik-test2] 
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total 
[2s]/[43.9s], memory [536mb]->[580.2mb]/[1015.6mb], all_pools {[young] 
[22.5mb]->[22.3mb]/[66.5mb]}{[survivor] [14.9kb]->[49.3kb]/[8.3mb]}{[old] 
[513.4mb]->[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?


Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :
>
> Hello
> I have some issue, when I index a particular data "note_source" (sql 
> longtext).
> I use the same analyzer for each fields (except date_source and id_source) 
> but for "note_source", I have a "warn monitor.jvm".
> When I remove "note_source", everything fine. If I don't use analyzer on 
> "note_source", everything fine, but if I use my analyzer on "note_source" I 
> have some crash.
>
> I think I have enough memory, I have used ES_HEAP_SIZE.
> Maybe my problem it's with accent (ascii, utf-8)
>
> Can you help me with this ?
>
>
>
> *My Setting*
>
>  public function createSetting($pf){
> $params = array('index' => $pf, 'body' => array(
> 'settings' => array(
> 'number_of_shards' => 5,
> 'number_of_replicas' => 0,
> 'analysis' => array(
> 'filter' => array(
> 'nGram' => array(
> "token_chars" =>array(),
> "type" => "nGram",
> "min_gram" => 3,
> "max_gram"  => 250
> )
> ),
> 'analyzer' => array(
> 'reuters' => array(
> 'type' => 'custom',
> 'tokenizer' => 'standard',
> 'filter' => array('lowercase', 'asciifolding', 
> 'nGram')
> )
> )
> )
> )
> ));
> $this->elasticsearchClient->indices()->create($params);
> return;
> }
>
>
> *My Indexing*
>
> public function indexTable($pf,$typeElement){
>
> $params =array(
> "index" =>'_river', 
> "type" => $typeElement, 
> "id" => "_meta", 
> "body" =>array(
>   
> "type" => "jdbc",
> "jdbc" => array(
> "url" => "jdbc:mysql://ip/name",
> "user" => 'root',
> "password" => 'mdp',
> "index" => $pf,
> "type" => $typeElement,
> "sql" => select id_source as _id, id_sous_theme, 
> titre_source, desc_source, note_source, adresse_source, type_source, 
> date_source from source,
> "max_bulk_requests" => 5,  
> )
> )
> 
> );
> 
>  
> $this->elasticsearchClient->index($params);
> }
>
> Thanks in advance.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/154b8ca2-a130-4062-b5ce-0e0fa63d98fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Cédric Hourcade
If you are only searching in the text you should index the images in
an other field field. With no analyzer ("index: not_analyzed"), or
even better "index: no" (not indexed). If you need to retrieve the
image data it's still in the _source.

But to be honest I wouldn't even store this kind of information in ES,
your index is going to be bigger, merges are going to be slower... I'd
keep the binary files stored elsewhere.

Cédric Hourcade
c...@wal.fr


On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard
 wrote:
> Yes, I am applying "reuters" on my document (compose by text and picture).
> My goal is to do my research on the text of the document with any word or
> part of a word.
>
> Yes the problem it's my nGram filter.
> How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an
> other but who satisfy my goal ?
>
> Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :
>>
>> Does it mean your applying the "reuters" analyzer on your base64
>> encoded pictures?
>>
>> I guess it generates a really huge number of tokens for each entry
>> because of your nGram filter (with a max at 250).
>>
>> Cédric Hourcade
>> c...@wal.fr
>>
>>
>> On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
>>  wrote:
>> > Information
>> > My "note_source" contain picture (.jpg, .png ...) in base64 and text.
>> >
>> > For my mapping I have used :
>> > "type" => "string"
>> > "analyzer" => "reuteurs" (the name of my analyzer)
>> >
>> >
>> > Any idea ?
>> >
>> > Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :
>> >>
>> >> Hello
>> >> I have some issue, when I index a particular data "note_source" (sql
>> >> longtext).
>> >> I use the same analyzer for each fields (except date_source and
>> >> id_source)
>> >> but for "note_source", I have a "warn monitor.jvm".
>> >> When I remove "note_source", everything fine. If I don't use analyzer
>> >> on
>> >> "note_source", everything fine, but if I use my analyzer on
>> >> "note_source" I
>> >> have some crash.
>> >>
>> >> I think I have enough memory, I have used ES_HEAP_SIZE.
>> >> Maybe my problem it's with accent (ascii, utf-8)
>> >>
>> >> Can you help me with this ?
>> >>
>> >>
>> >>
>> >> My Setting
>> >>
>> >>  public function createSetting($pf){
>> >> $params = array('index' => $pf, 'body' => array(
>> >> 'settings' => array(
>> >> 'number_of_shards' => 5,
>> >> 'number_of_replicas' => 0,
>> >> 'analysis' => array(
>> >> 'filter' => array(
>> >> 'nGram' => array(
>> >> "token_chars" =>array(),
>> >> "type" => "nGram",
>> >> "min_gram" => 3,
>> >> "max_gram"  => 250
>> >> )
>> >> ),
>> >> 'analyzer' => array(
>> >> 'reuters' => array(
>> >> 'type' => 'custom',
>> >> 'tokenizer' => 'standard',
>> >> 'filter' => array('lowercase', 'asciifolding',
>> >> 'nGram')
>> >> )
>> >> )
>> >> )
>> >> )
>> >> ));
>> >> $this->elasticsearchClient->indices()->create($params);
>> >> return;
>> >> }
>> >>
>> >>
>> >> My Indexing
>> >>
>> >> public function indexTable($pf,$typeElement){
>> >>
>> >> $params =array(
>> >> "index" =>'_river',
>> >> "type" => $typeElement,
>> >> "id" => "_meta",
>> >> "body" =>array(
>> >>
>> >> "type" => "jdbc",
>> >> "jdbc" => array(
>> >> "url" => "jdbc:mysql://ip/name",
>> >> "user" => 'root',
>> >> "password" => 'mdp',
>> >> "index" => $pf,
>> >> "type" => $typeElement,
>> >> "sql" => select id_source as _id, id_sous_theme,
>> >> titre_source, desc_source, note_source, adresse_source, type_source,
>> >> date_source from source,
>> >> "max_bulk_requests" => 5,
>> >> )
>> >> )
>> >>
>> >> );
>> >>
>> >>
>> >> $this->elasticsearchClient->index($params);
>> >> }
>> >>
>> >> Thanks in advance.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "elasticsearch" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to elasticsearc...@googlegroups.com.
>> > To view this discussion on the web visit
>> >
>> > https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+

Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
Yes, I am applying "reuters" on my document (compose by text and picture).
My goal is to do my research on the text of the document with any word or 
part of a word.

Yes the problem it's my nGram filter.
How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an 
other but who satisfy my goal ?

Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :
>
> Does it mean your applying the "reuters" analyzer on your base64 
> encoded pictures? 
>
> I guess it generates a really huge number of tokens for each entry 
> because of your nGram filter (with a max at 250). 
>
> Cédric Hourcade 
> c...@wal.fr  
>
>
> On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard 
> > wrote: 
> > Information 
> > My "note_source" contain picture (.jpg, .png ...) in base64 and text. 
> > 
> > For my mapping I have used : 
> > "type" => "string" 
> > "analyzer" => "reuteurs" (the name of my analyzer) 
> > 
> > 
> > Any idea ? 
> > 
> > Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : 
> >> 
> >> Hello 
> >> I have some issue, when I index a particular data "note_source" (sql 
> >> longtext). 
> >> I use the same analyzer for each fields (except date_source and 
> id_source) 
> >> but for "note_source", I have a "warn monitor.jvm". 
> >> When I remove "note_source", everything fine. If I don't use analyzer 
> on 
> >> "note_source", everything fine, but if I use my analyzer on 
> "note_source" I 
> >> have some crash. 
> >> 
> >> I think I have enough memory, I have used ES_HEAP_SIZE. 
> >> Maybe my problem it's with accent (ascii, utf-8) 
> >> 
> >> Can you help me with this ? 
> >> 
> >> 
> >> 
> >> My Setting 
> >> 
> >>  public function createSetting($pf){ 
> >> $params = array('index' => $pf, 'body' => array( 
> >> 'settings' => array( 
> >> 'number_of_shards' => 5, 
> >> 'number_of_replicas' => 0, 
> >> 'analysis' => array( 
> >> 'filter' => array( 
> >> 'nGram' => array( 
> >> "token_chars" =>array(), 
> >> "type" => "nGram", 
> >> "min_gram" => 3, 
> >> "max_gram"  => 250 
> >> ) 
> >> ), 
> >> 'analyzer' => array( 
> >> 'reuters' => array( 
> >> 'type' => 'custom', 
> >> 'tokenizer' => 'standard', 
> >> 'filter' => array('lowercase', 'asciifolding', 
> >> 'nGram') 
> >> ) 
> >> ) 
> >> ) 
> >> ) 
> >> )); 
> >> $this->elasticsearchClient->indices()->create($params); 
> >> return; 
> >> } 
> >> 
> >> 
> >> My Indexing 
> >> 
> >> public function indexTable($pf,$typeElement){ 
> >> 
> >> $params =array( 
> >> "index" =>'_river', 
> >> "type" => $typeElement, 
> >> "id" => "_meta", 
> >> "body" =>array( 
> >> 
> >> "type" => "jdbc", 
> >> "jdbc" => array( 
> >> "url" => "jdbc:mysql://ip/name", 
> >> "user" => 'root', 
> >> "password" => 'mdp', 
> >> "index" => $pf, 
> >> "type" => $typeElement, 
> >> "sql" => select id_source as _id, id_sous_theme, 
> >> titre_source, desc_source, note_source, adresse_source, type_source, 
> >> date_source from source, 
> >> "max_bulk_requests" => 5, 
> >> ) 
> >> ) 
> >> 
> >> ); 
> >> 
> >> 
> >> $this->elasticsearchClient->index($params); 
> >> } 
> >> 
> >> Thanks in advance. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to elasticsearc...@googlegroups.com . 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
>  
>
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Cédric Hourcade
Does it mean your applying the "reuters" analyzer on your base64
encoded pictures?

I guess it generates a really huge number of tokens for each entry
because of your nGram filter (with a max at 250).

Cédric Hourcade
c...@wal.fr


On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
 wrote:
> Information
> My "note_source" contain picture (.jpg, .png ...) in base64 and text.
>
> For my mapping I have used :
> "type" => "string"
> "analyzer" => "reuteurs" (the name of my analyzer)
>
>
> Any idea ?
>
> Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :
>>
>> Hello
>> I have some issue, when I index a particular data "note_source" (sql
>> longtext).
>> I use the same analyzer for each fields (except date_source and id_source)
>> but for "note_source", I have a "warn monitor.jvm".
>> When I remove "note_source", everything fine. If I don't use analyzer on
>> "note_source", everything fine, but if I use my analyzer on "note_source" I
>> have some crash.
>>
>> I think I have enough memory, I have used ES_HEAP_SIZE.
>> Maybe my problem it's with accent (ascii, utf-8)
>>
>> Can you help me with this ?
>>
>>
>>
>> My Setting
>>
>>  public function createSetting($pf){
>> $params = array('index' => $pf, 'body' => array(
>> 'settings' => array(
>> 'number_of_shards' => 5,
>> 'number_of_replicas' => 0,
>> 'analysis' => array(
>> 'filter' => array(
>> 'nGram' => array(
>> "token_chars" =>array(),
>> "type" => "nGram",
>> "min_gram" => 3,
>> "max_gram"  => 250
>> )
>> ),
>> 'analyzer' => array(
>> 'reuters' => array(
>> 'type' => 'custom',
>> 'tokenizer' => 'standard',
>> 'filter' => array('lowercase', 'asciifolding',
>> 'nGram')
>> )
>> )
>> )
>> )
>> ));
>> $this->elasticsearchClient->indices()->create($params);
>> return;
>> }
>>
>>
>> My Indexing
>>
>> public function indexTable($pf,$typeElement){
>>
>> $params =array(
>> "index" =>'_river',
>> "type" => $typeElement,
>> "id" => "_meta",
>> "body" =>array(
>>
>> "type" => "jdbc",
>> "jdbc" => array(
>> "url" => "jdbc:mysql://ip/name",
>> "user" => 'root',
>> "password" => 'mdp',
>> "index" => $pf,
>> "type" => $typeElement,
>> "sql" => select id_source as _id, id_sous_theme,
>> titre_source, desc_source, note_source, adresse_source, type_source,
>> date_source from source,
>> "max_bulk_requests" => 5,
>> )
>> )
>>
>> );
>>
>>
>> $this->elasticsearchClient->index($params);
>> }
>>
>> Thanks in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJQxjPM8qvsmcxB7Xu4KqN28pfvk%2BcBn5bpV2Emw42M5HzAAUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
Information
My "note_source" contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
"type" => "string"
"analyzer" => "reuteurs" (the name of my analyzer)


Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :
>
> Hello
> I have some issue, when I index a particular data "note_source" (sql 
> longtext).
> I use the same analyzer for each fields (except date_source and id_source) 
> but for "note_source", I have a "warn monitor.jvm".
> When I remove "note_source", everything fine. If I don't use analyzer on 
> "note_source", everything fine, but if I use my analyzer on "note_source" I 
> have some crash.
>
> I think I have enough memory, I have used ES_HEAP_SIZE.
> Maybe my problem it's with accent (ascii, utf-8)
>
> Can you help me with this ?
>
>
>
> *My Setting*
>
>  public function createSetting($pf){
> $params = array('index' => $pf, 'body' => array(
> 'settings' => array(
> 'number_of_shards' => 5,
> 'number_of_replicas' => 0,
> 'analysis' => array(
> 'filter' => array(
> 'nGram' => array(
> "token_chars" =>array(),
> "type" => "nGram",
> "min_gram" => 3,
> "max_gram"  => 250
> )
> ),
> 'analyzer' => array(
> 'reuters' => array(
> 'type' => 'custom',
> 'tokenizer' => 'standard',
> 'filter' => array('lowercase', 'asciifolding', 
> 'nGram')
> )
> )
> )
> )
> ));
> $this->elasticsearchClient->indices()->create($params);
> return;
> }
>
>
> *My Indexing*
>
> public function indexTable($pf,$typeElement){
>
> $params =array(
> "index" =>'_river', 
> "type" => $typeElement, 
> "id" => "_meta", 
> "body" =>array(
>   
> "type" => "jdbc",
> "jdbc" => array(
> "url" => "jdbc:mysql://ip/name",
> "user" => 'root',
> "password" => 'mdp',
> "index" => $pf,
> "type" => $typeElement,
> "sql" => select id_source as _id, id_sous_theme, 
> titre_source, desc_source, note_source, adresse_source, type_source, 
> date_source from source,
> "max_bulk_requests" => 5,  
> )
> )
> 
> );
> 
>  
> $this->elasticsearchClient->index($params);
> }
>
> Thanks in advance.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


problem indexing with my analyzer

2014-06-19 Thread Tanguy Bernard
Hello
I have some issue, when I index a particular data "note_source" (sql 
longtext).
I use the same analyzer for each fields (except date_source and id_source) 
but for "note_source", I have a "warn monitor.jvm".
When I remove "note_source", everything fine. If I don't use analyzer on 
"note_source", everything fine, but if I use my analyzer on "note_source" I 
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?



*My Setting*

 public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'nGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram"  => 250
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding', 
'nGram')
)
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
return;
}


*My Indexing*

public function indexTable($pf,$typeElement){
   
$params =array(
"index" =>'_river', 
"type" => $typeElement, 
"id" => "_meta", 
"body" =>array(
  
"type" => "jdbc",
"jdbc" => array(
"url" => "jdbc:mysql://ip/name",
"user" => 'root',
"password" => 'mdp',
"index" => $pf,
"type" => $typeElement,
"sql" => select id_source as _id, id_sous_theme, 
titre_source, desc_source, note_source, adresse_source, type_source, 
date_source from source,
"max_bulk_requests" => 5,  
)
)

);

 
$this->elasticsearchClient->index($params);
}

Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dd6e60dc-d394-4d7d-b994-2105002d7bd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.