Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
Information
My note_source contain picture (.jpg, .png ...) in base64 and text.

For my mapping I have used :
type = string
analyzer = reuteurs (the name of my analyzer)


Any idea ?

Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

 Hello
 I have some issue, when I index a particular data note_source (sql 
 longtext).
 I use the same analyzer for each fields (except date_source and id_source) 
 but for note_source, I have a warn monitor.jvm.
 When I remove note_source, everything fine. If I don't use analyzer on 
 note_source, everything fine, but if I use my analyzer on note_source I 
 have some crash.

 I think I have enough memory, I have used ES_HEAP_SIZE.
 Maybe my problem it's with accent (ascii, utf-8)

 Can you help me with this ?



 *My Setting*

  public function createSetting($pf){
 $params = array('index' = $pf, 'body' = array(
 'settings' = array(
 'number_of_shards' = 5,
 'number_of_replicas' = 0,
 'analysis' = array(
 'filter' = array(
 'nGram' = array(
 token_chars =array(),
 type = nGram,
 min_gram = 3,
 max_gram  = 250
 )
 ),
 'analyzer' = array(
 'reuters' = array(
 'type' = 'custom',
 'tokenizer' = 'standard',
 'filter' = array('lowercase', 'asciifolding', 
 'nGram')
 )
 )
 )
 )
 ));
 $this-elasticsearchClient-indices()-create($params);
 return;
 }


 *My Indexing*

 public function indexTable($pf,$typeElement){

 $params =array(
 index ='_river', 
 type = $typeElement, 
 id = _meta, 
 body =array(
   
 type = jdbc,
 jdbc = array(
 url = jdbc:mysql://ip/name,
 user = 'root',
 password = 'mdp',
 index = $pf,
 type = $typeElement,
 sql = select id_source as _id, id_sous_theme, 
 titre_source, desc_source, note_source, adresse_source, type_source, 
 date_source from source,
 max_bulk_requests = 5,  
 )
 )
 
 );
 
  
 $this-elasticsearchClient-index($params);
 }

 Thanks in advance.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Cédric Hourcade
Does it mean your applying the reuters analyzer on your base64
encoded pictures?

I guess it generates a really huge number of tokens for each entry
because of your nGram filter (with a max at 250).

Cédric Hourcade
c...@wal.fr


On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
bernardtanguy1...@gmail.com wrote:
 Information
 My note_source contain picture (.jpg, .png ...) in base64 and text.

 For my mapping I have used :
 type = string
 analyzer = reuteurs (the name of my analyzer)


 Any idea ?

 Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

 Hello
 I have some issue, when I index a particular data note_source (sql
 longtext).
 I use the same analyzer for each fields (except date_source and id_source)
 but for note_source, I have a warn monitor.jvm.
 When I remove note_source, everything fine. If I don't use analyzer on
 note_source, everything fine, but if I use my analyzer on note_source I
 have some crash.

 I think I have enough memory, I have used ES_HEAP_SIZE.
 Maybe my problem it's with accent (ascii, utf-8)

 Can you help me with this ?



 My Setting

  public function createSetting($pf){
 $params = array('index' = $pf, 'body' = array(
 'settings' = array(
 'number_of_shards' = 5,
 'number_of_replicas' = 0,
 'analysis' = array(
 'filter' = array(
 'nGram' = array(
 token_chars =array(),
 type = nGram,
 min_gram = 3,
 max_gram  = 250
 )
 ),
 'analyzer' = array(
 'reuters' = array(
 'type' = 'custom',
 'tokenizer' = 'standard',
 'filter' = array('lowercase', 'asciifolding',
 'nGram')
 )
 )
 )
 )
 ));
 $this-elasticsearchClient-indices()-create($params);
 return;
 }


 My Indexing

 public function indexTable($pf,$typeElement){

 $params =array(
 index ='_river',
 type = $typeElement,
 id = _meta,
 body =array(

 type = jdbc,
 jdbc = array(
 url = jdbc:mysql://ip/name,
 user = 'root',
 password = 'mdp',
 index = $pf,
 type = $typeElement,
 sql = select id_source as _id, id_sous_theme,
 titre_source, desc_source, note_source, adresse_source, type_source,
 date_source from source,
 max_bulk_requests = 5,
 )
 )

 );


 $this-elasticsearchClient-index($params);
 }

 Thanks in advance.

 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJQxjPM8qvsmcxB7Xu4KqN28pfvk%2BcBn5bpV2Emw42M5HzAAUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
Yes, I am applying reuters on my document (compose by text and picture).
My goal is to do my research on the text of the document with any word or 
part of a word.

Yes the problem it's my nGram filter.
How do I solve this problem ? Deacrease nGram max ? Change Analyzer by an 
other but who satisfy my goal ?

Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :

 Does it mean your applying the reuters analyzer on your base64 
 encoded pictures? 

 I guess it generates a really huge number of tokens for each entry 
 because of your nGram filter (with a max at 250). 

 Cédric Hourcade 
 c...@wal.fr javascript: 


 On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard 
 bernardt...@gmail.com javascript: wrote: 
  Information 
  My note_source contain picture (.jpg, .png ...) in base64 and text. 
  
  For my mapping I have used : 
  type = string 
  analyzer = reuteurs (the name of my analyzer) 
  
  
  Any idea ? 
  
  Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : 
  
  Hello 
  I have some issue, when I index a particular data note_source (sql 
  longtext). 
  I use the same analyzer for each fields (except date_source and 
 id_source) 
  but for note_source, I have a warn monitor.jvm. 
  When I remove note_source, everything fine. If I don't use analyzer 
 on 
  note_source, everything fine, but if I use my analyzer on 
 note_source I 
  have some crash. 
  
  I think I have enough memory, I have used ES_HEAP_SIZE. 
  Maybe my problem it's with accent (ascii, utf-8) 
  
  Can you help me with this ? 
  
  
  
  My Setting 
  
   public function createSetting($pf){ 
  $params = array('index' = $pf, 'body' = array( 
  'settings' = array( 
  'number_of_shards' = 5, 
  'number_of_replicas' = 0, 
  'analysis' = array( 
  'filter' = array( 
  'nGram' = array( 
  token_chars =array(), 
  type = nGram, 
  min_gram = 3, 
  max_gram  = 250 
  ) 
  ), 
  'analyzer' = array( 
  'reuters' = array( 
  'type' = 'custom', 
  'tokenizer' = 'standard', 
  'filter' = array('lowercase', 'asciifolding', 
  'nGram') 
  ) 
  ) 
  ) 
  ) 
  )); 
  $this-elasticsearchClient-indices()-create($params); 
  return; 
  } 
  
  
  My Indexing 
  
  public function indexTable($pf,$typeElement){ 
  
  $params =array( 
  index ='_river', 
  type = $typeElement, 
  id = _meta, 
  body =array( 
  
  type = jdbc, 
  jdbc = array( 
  url = jdbc:mysql://ip/name, 
  user = 'root', 
  password = 'mdp', 
  index = $pf, 
  type = $typeElement, 
  sql = select id_source as _id, id_sous_theme, 
  titre_source, desc_source, note_source, adresse_source, type_source, 
  date_source from source, 
  max_bulk_requests = 5, 
  ) 
  ) 
  
  ); 
  
  
  $this-elasticsearchClient-index($params); 
  } 
  
  Thanks in advance. 
  
  -- 
  You received this message because you are subscribed to the Google 
 Groups 
  elasticsearch group. 
  To unsubscribe from this group and stop receiving emails from it, send 
 an 
  email to elasticsearc...@googlegroups.com javascript:. 
  To view this discussion on the web visit 
  
 https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
  

  For more options, visit https://groups.google.com/d/optout. 


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm  ] [ik-test2] 
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total 
[2s]/[43.9s], memory [536mb]-[580.2mb]/[1015.6mb], all_pools {[young] 
[22.5mb]-[22.3mb]/[66.5mb]}{[survivor] [14.9kb]-[49.3kb]/[8.3mb]}{[old] 
[513.4mb]-[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?


Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :

 Hello
 I have some issue, when I index a particular data note_source (sql 
 longtext).
 I use the same analyzer for each fields (except date_source and id_source) 
 but for note_source, I have a warn monitor.jvm.
 When I remove note_source, everything fine. If I don't use analyzer on 
 note_source, everything fine, but if I use my analyzer on note_source I 
 have some crash.

 I think I have enough memory, I have used ES_HEAP_SIZE.
 Maybe my problem it's with accent (ascii, utf-8)

 Can you help me with this ?



 *My Setting*

  public function createSetting($pf){
 $params = array('index' = $pf, 'body' = array(
 'settings' = array(
 'number_of_shards' = 5,
 'number_of_replicas' = 0,
 'analysis' = array(
 'filter' = array(
 'nGram' = array(
 token_chars =array(),
 type = nGram,
 min_gram = 3,
 max_gram  = 250
 )
 ),
 'analyzer' = array(
 'reuters' = array(
 'type' = 'custom',
 'tokenizer' = 'standard',
 'filter' = array('lowercase', 'asciifolding', 
 'nGram')
 )
 )
 )
 )
 ));
 $this-elasticsearchClient-indices()-create($params);
 return;
 }


 *My Indexing*

 public function indexTable($pf,$typeElement){

 $params =array(
 index ='_river', 
 type = $typeElement, 
 id = _meta, 
 body =array(
   
 type = jdbc,
 jdbc = array(
 url = jdbc:mysql://ip/name,
 user = 'root',
 password = 'mdp',
 index = $pf,
 type = $typeElement,
 sql = select id_source as _id, id_sous_theme, 
 titre_source, desc_source, note_source, adresse_source, type_source, 
 date_source from source,
 max_bulk_requests = 5,  
 )
 )
 
 );
 
  
 $this-elasticsearchClient-index($params);
 }

 Thanks in advance.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/154b8ca2-a130-4062-b5ce-0e0fa63d98fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
The user copy/paste the content of an html page and me, I index this 
information. I take the entire document with image. I can't change this 
behavior.

I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm  ] [ik-test2] 
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total 
[2s]/[43.9s], memory [536mb]-[580.2mb]/[1015.6mb], all_pools {[young] 
[22.5mb]-[22.3mb]/[66.5mb]}{[survivor] [14.9kb]-[49.3kb]/[8.3mb]}{[old] 
[513.4mb]-[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?

Le vendredi 20 juin 2014 11:45:22 UTC+2, Cédric Hourcade a écrit :

 If you are only searching in the text you should index the images in 
 an other field field. With no analyzer (index: not_analyzed), or 
 even better index: no (not indexed). If you need to retrieve the 
 image data it's still in the _source. 

 But to be honest I wouldn't even store this kind of information in ES, 
 your index is going to be bigger, merges are going to be slower... I'd 
 keep the binary files stored elsewhere. 

 Cédric Hourcade 
 c...@wal.fr javascript: 


 On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard 
 bernardt...@gmail.com javascript: wrote: 
  Yes, I am applying reuters on my document (compose by text and 
 picture). 
  My goal is to do my research on the text of the document with any word 
 or 
  part of a word. 
  
  Yes the problem it's my nGram filter. 
  How do I solve this problem ? Deacrease nGram max ? Change Analyzer by 
 an 
  other but who satisfy my goal ? 
  
  Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit : 
  
  Does it mean your applying the reuters analyzer on your base64 
  encoded pictures? 
  
  I guess it generates a really huge number of tokens for each entry 
  because of your nGram filter (with a max at 250). 
  
  Cédric Hourcade 
  c...@wal.fr 
  
  
  On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard 
  bernardt...@gmail.com wrote: 
   Information 
   My note_source contain picture (.jpg, .png ...) in base64 and text. 
   
   For my mapping I have used : 
   type = string 
   analyzer = reuteurs (the name of my analyzer) 
   
   
   Any idea ? 
   
   Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : 
   
   Hello 
   I have some issue, when I index a particular data note_source (sql 
   longtext). 
   I use the same analyzer for each fields (except date_source and 
   id_source) 
   but for note_source, I have a warn monitor.jvm. 
   When I remove note_source, everything fine. If I don't use 
 analyzer 
   on 
   note_source, everything fine, but if I use my analyzer on 
   note_source I 
   have some crash. 
   
   I think I have enough memory, I have used ES_HEAP_SIZE. 
   Maybe my problem it's with accent (ascii, utf-8) 
   
   Can you help me with this ? 
   
   
   
   My Setting 
   
public function createSetting($pf){ 
   $params = array('index' = $pf, 'body' = array( 
   'settings' = array( 
   'number_of_shards' = 5, 
   'number_of_replicas' = 0, 
   'analysis' = array( 
   'filter' = array( 
   'nGram' = array( 
   token_chars =array(), 
   type = nGram, 
   min_gram = 3, 
   max_gram  = 250 
   ) 
   ), 
   'analyzer' = array( 
   'reuters' = array( 
   'type' = 'custom', 
   'tokenizer' = 'standard', 
   'filter' = array('lowercase', 
 'asciifolding', 
   'nGram') 
   ) 
   ) 
   ) 
   ) 
   )); 
   $this-elasticsearchClient-indices()-create($params); 
   return; 
   } 
   
   
   My Indexing 
   
   public function indexTable($pf,$typeElement){ 
   
   $params =array( 
   index ='_river', 
   type = $typeElement, 
   id = _meta, 
   body =array( 
   
   type = jdbc, 
   jdbc = array( 
   url = jdbc:mysql://ip/name, 
   user = 'root', 
   password = 'mdp', 
   index = $pf, 
   type = $typeElement, 
   sql = select id_source as _id, id_sous_theme, 
   titre_source, desc_source, note_source, adresse_source, type_source, 
   date_source from source, 
   max_bulk_requests = 5, 
   ) 
   ) 
   
   ); 
   
   
   $this-elasticsearchClient-index($params); 
   } 
   
   Thanks in advance. 
   
   -- 
   You received this message because you are subscribed to the Google 
   Groups 
   elasticsearch group. 
   To unsubscribe from this group and stop receiving emails from it, 
 send 
 

Re: problem indexing with my analyzer

2014-06-20 Thread Cédric Hourcade
If your base64 encodes are long, they are going to be splited in a lot
of tokens by the standard tokenizer.

Theses tokens are often going to be a lot longer than standard words,
so your nGram filter will generate even more tokens, a lot more than
with standard text. That may be your problem there.

You should really try to strip the encoded images with a simple regex
from your documents before indexing them. If you need to keep the
source, put the raw text in an unindexed field, and the cleaned one in
another.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJQxjPPD4UXAjX%2Buwi84LSsPeiy0C80uzcb4C1QFxwLzfyjQGA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Tanguy Bernard
Thank you Cédric Hourcade !

Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :

 If your base64 encodes are long, they are going to be splited in a lot 
 of tokens by the standard tokenizer. 

 Theses tokens are often going to be a lot longer than standard words, 
 so your nGram filter will generate even more tokens, a lot more than 
 with standard text. That may be your problem there. 

 You should really try to strip the encoded images with a simple regex 
 from your documents before indexing them. If you need to keep the 
 source, put the raw text in an unindexed field, and the cleaned one in 
 another. 


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: problem indexing with my analyzer

2014-06-20 Thread Clinton Gormley
You seriously don't want 3..250 length ngrams That's ENORMOUS

Typically set min/max to 3 or 4, and that's it

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching


On 20 June 2014 16:05, Tanguy Bernard bernardtanguy1...@gmail.com wrote:

 Thank you Cédric Hourcade !

 Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :

 If your base64 encodes are long, they are going to be splited in a lot
 of tokens by the standard tokenizer.

 Theses tokens are often going to be a lot longer than standard words,
 so your nGram filter will generate even more tokens, a lot more than
 with standard text. That may be your problem there.

 You should really try to strip the encoded images with a simple regex
 from your documents before indexing them. If you need to keep the
 source, put the raw text in an unindexed field, and the cleaned one in
 another.

  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRS_zD%3DkVpKBpqp3hkcgJacAWsETGgJwMQJM%2BqJMuvscw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


problem indexing with my analyzer

2014-06-19 Thread Tanguy Bernard
Hello
I have some issue, when I index a particular data note_source (sql 
longtext).
I use the same analyzer for each fields (except date_source and id_source) 
but for note_source, I have a warn monitor.jvm.
When I remove note_source, everything fine. If I don't use analyzer on 
note_source, everything fine, but if I use my analyzer on note_source I 
have some crash.

I think I have enough memory, I have used ES_HEAP_SIZE.
Maybe my problem it's with accent (ascii, utf-8)

Can you help me with this ?



*My Setting*

 public function createSetting($pf){
$params = array('index' = $pf, 'body' = array(
'settings' = array(
'number_of_shards' = 5,
'number_of_replicas' = 0,
'analysis' = array(
'filter' = array(
'nGram' = array(
token_chars =array(),
type = nGram,
min_gram = 3,
max_gram  = 250
)
),
'analyzer' = array(
'reuters' = array(
'type' = 'custom',
'tokenizer' = 'standard',
'filter' = array('lowercase', 'asciifolding', 
'nGram')
)
)
)
)
));
$this-elasticsearchClient-indices()-create($params);
return;
}


*My Indexing*

public function indexTable($pf,$typeElement){
   
$params =array(
index ='_river', 
type = $typeElement, 
id = _meta, 
body =array(
  
type = jdbc,
jdbc = array(
url = jdbc:mysql://ip/name,
user = 'root',
password = 'mdp',
index = $pf,
type = $typeElement,
sql = select id_source as _id, id_sous_theme, 
titre_source, desc_source, note_source, adresse_source, type_source, 
date_source from source,
max_bulk_requests = 5,  
)
)

);

 
$this-elasticsearchClient-index($params);
}

Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dd6e60dc-d394-4d7d-b994-2105002d7bd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.