Re: Sanitize a text for indexing

2015-03-12 Thread Itamar Syn-Hershko
See
http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer  Consultant
Lucene.NET committer and PMC member

On Thu, Mar 12, 2015 at 10:52 AM, Bernhard Berger 
bernhardberger3...@gmail.com wrote:

 Hi,

 while indexing various comments from Facebook I sometimes get Exceptions:

 IllegalArgumentException: Document contains at least one immense term...

 Is it possible to sanitize a text for indexing in Elasticsearch so it doesn't 
 throw these Exceptions? Maybe there is a Filter to remove too-long Unicode 
 terms?

 For details about the failing documents, see my (unanswered) Stackoverflow 
 question: 
 http://stackoverflow.com/questions/28941570/remove-long-unicode-terms-from-string-in-java
 (I fear to break another Elasticsearch-based (Maillist) crawler, so I better 
 don't write the failing doc text here ;-) )

  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/93a5ed0d-6486-48b4-a228-1aff47d14ce0%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/93a5ed0d-6486-48b4-a228-1aff47d14ce0%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZtqBSYcM9oFRa%3DGsWeafzHsE%3DSVMSa6H9e1aVfDbS2q%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Sanitize a text for indexing

2015-03-12 Thread Bernhard Berger

On 12.03.15 10:03, Itamar Syn-Hershko wrote:
See 
http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html


Unfortunately the length token filter also doesn't filter out these 
immense terms.
See my example from https://gist.github.com/Hocdoc/68b5fcf8819a51816b53 
: I have created a length filter for terms greater than 5000 
(characters? bytes?) but still get the exception when using the 
icu_normalizer :


|IllegalArgumentException:  Document  contains at least one immense term
in field=message  (whose UTF8 encoding is longer than the max length32766),|

( length of this message value is 3728 Bytes UTF8-encoded)



--
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5501637E.2070400%40gmail.com.
For more options, visit https://groups.google.com/d/optout.