Re: problem indexing with my analyzer

Tanguy Bernard Wed, 25 Jun 2014 07:32:14 -0700

Yes I did not know how nGram works !
I find a perfect solution for my picture (base64) problem : use *'char_filter' 
=>array('html_strip'),*



public function createSetting($pf){
        $params = array('index' => $pf, 'body' => array(
        'settings' => array(
            'number_of_shards' => 5,
            'number_of_replicas' => 0,
            'analysis' => array(
                'filter' => array(
                    'MYnGram' => array(
                        "token_chars" =>array(),
                        "type" => "nGram",
                        "min_gram" => 3,
                        "max_gram"  => 20
                    )
                ),
                'analyzer' => array(
                    'reuters' => array(
                        'type' => 'custom',
                        'tokenizer' => 'standard',
                        'filter' => array('lowercase', 'asciifolding', 
'MYnGram'),
                        'char_filter' =>array('html_strip'),
                    ),
                    
                )
            )
        )
        ));
        $this->elasticsearchClient->indices()->create($params);

   }

Thanks to all of you !


Le samedi 21 juin 2014 00:35:39 UTC+2, Clinton Gormley a écrit :
>
> You seriously don't want 3..250 length ngrams!!!! That's ENORMOUS
>
> Typically set min/max to 3 or 4, and that's it
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching
>
>
> On 20 June 2014 16:05, Tanguy Bernard <bernardt...@gmail.com <javascript:>
> > wrote:
>
>> Thank you Cédric Hourcade !
>>
>> Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :
>>
>>> If your base64 encodes are long, they are going to be splited in a lot 
>>> of tokens by the standard tokenizer. 
>>>
>>> Theses tokens are often going to be a lot longer than standard words, 
>>> so your nGram filter will generate even more tokens, a lot more than 
>>> with standard text. That may be your problem there. 
>>>
>>> You should really try to strip the encoded images with a simple regex 
>>> from your documents before indexing them. If you need to keep the 
>>> source, put the raw text in an unindexed field, and the cleaned one in 
>>> another. 
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2bdd5f30-8e97-43e0-8478-08cc26a03ed9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: problem indexing with my analyzer

Reply via email to