RE: Looking For Tokenizer With Custom Delimeter

Uwe Schindler Mon, 08 Jan 2018 05:27:51 -0800

Hi

It is part of the analyzers-common module, it is not included in Lucene's core. 
Lucene's core module only has a single analyzer (StandardAnalyzer) and some 
helper classes, but not the full set of multi-purpose and language specific 
ones.


Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> Sent: Monday, January 8, 2018 2:09 PM
> To: java-user@lucene.apache.org
> Subject: Re: Looking For Tokenizer With Custom Delimeter
> 
> Thanks for the solution, however I am unable to access CharTokenizer class,
> when I import it using:
> 
> import org.apache.lucene.analysis.util.*;
> 
> Although I am able to access classes directly under analysis (or
> analysis.standard) just fine with the import statement:
> import org.apache.lucene.analysis.*;
> 
> Does this appear as a Lucene specific problem?
> 
> P.S. I'm using Maven for managing my dependencies with the following two
> statements for Lucene:
> 
>         <dependency>
>             <groupId>org.apache.lucene</groupId>
>             <artifactId>lucene-core</artifactId>
>             <version>7.1.0</version>
>         </dependency>
> 
>         <dependency>
>             <groupId>org.apache.lucene</groupId>
>             <artifactId>lucene-queryparser</artifactId>
>             <version>7.1.0</version>
>         </dependency>
> 
> Regards,
> Armīns
> 
> On Mon, Jan 8, 2018 at 12:53 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> > Moin,
> >
> > Plain easy to do customize with lambdas! E.g., an elegant way to create a
> > tokenizer which behaves exactly as WhitespaceTokenizer and
> LowerCaseFilter
> > is:
> >
> > Tokenizer tok =
> CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace,
> > Character::toLowerCase);
> >
> > Adjust with Lambdas and you can create any tokenizer based on any
> > character check, so to check for whitespace or underscore:
> >
> > Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch ->
> > Character.isWhitespace || ch == '_');
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> > > Sent: Monday, January 8, 2018 11:30 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Looking For Tokenizer With Custom Delimeter
> > >
> > > Hi,
> > >
> > > I am looking for a tokenizer, where I could specify a delimiter by which
> > > the words are tokenized, for example if I choose the delimiters as ' '
> > and
> > > '_' the following string:
> > > "foo__bar doo"
> > > would be tokenized into:
> > > "foo", "", "bar", "doo"
> > > (The analyzer could further filter empty tokens, since having the empty
> > > string token is not critical).
> > >
> > > Is such functionality built into Lucene (I'm working with 7.1.0) and does
> > > this seem like the correct approach to the problem?
> > >
> > > Regards,
> > > Armīns
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Looking For Tokenizer With Custom Delimeter

Reply via email to