RE: customizing standard tokenizer

2012-02-20 Thread Torsten Krah
Thx, will use the custom tokenizer. Its less error prone than the
"workarounds" mentioned.


smime.p7s
Description: S/MIME cryptographic signature


RE: customizing standard tokenizer

2012-02-17 Thread Steven A Rowe
Hi Torsten,

The Lucene StandardTokenizer is written in JFlex (http://jflex.de) - you can 
see the version 3.X specification at: 



You can make changes to this file, then run "ant jflex-StandardAnalyzer" from 
the checked-out branch_3x sources or a source release (in the lucene/core/ 
directory in branch_3x, and in the lucene/ directory in a pre-3.6 source 
release), to generate the corresponding java source code at:

  
lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java

However, I recommend a simpler strategy: use a MappingCharFilter[1] in front of 
your tokenizer to map the tokens you want left intact to strings that will not 
be broken up by the tokenizer.  For example, Lucene-Core could be mapped to 
Lucene_Core, because UAX#29[2], upon which StandardTokenizer is based, 
considers the underscore to be a "word" character, and so will leave 
Lucene_Core as a single token.  You would need to use this strategy at both 
index-time and query-time.

(I was going to add that if you wanted your indexed tokens to be the same as 
their original form, you could add a MappingTokenFilter after your tokenizer to 
do the reverse mapping, but such a thing does not yet exist :( - however, there 
is a JIRA issue for this idea: 
.)

Steve

[1] 


[2] http://unicode.org/reports/tr29/

> -Original Message-
> From: Torsten Krah [mailto:tk...@fachschaft.imn.htwk-leipzig.de]
> Sent: Friday, February 17, 2012 9:15 AM
> To: solr-user@lucene.apache.org
> Subject: customizing standard tokenizer
> 
> Hi,
> 
> is it possible to extend the standard tokenizer or use a custom one
> (possible via extending the standard one) to add some "custom" tokens
> like Lucene-Core to be "one" token.
> 
> regards


Re: customizing standard tokenizer

2012-02-17 Thread Em
Hi Torsten,

did you have a look at WordDelimiterTokenFilter?

Sounds like it fits your needs.

Regards,
Em

Am 17.02.2012 15:14, schrieb Torsten Krah:
> Hi,
> 
> is it possible to extend the standard tokenizer or use a custom one
> (possible via extending the standard one) to add some "custom" tokens
> like Lucene-Core to be "one" token.
> 
> regards