This pattern split tokens *only* in the presence of parentheses with adjoining
whitespace, and includes the parentheses with the tokens:
(?<=\))\s+|\s+(?=\()
So you'll get this kind of behavior:
Tottenham Hotspur (London)
F.C. Internationale (milan)
FC Midtjylland (Herning) (Ikast)
to
Tottenham Hotspur
(London)
F.C. Internationale
(milan)
FC Midtjylland
(Herning)
(Ikast)
Steve
> -----Original Message-----
> From: Erick Erickson [mailto:[email protected]]
> Sent: Friday, April 15, 2011 1:51 PM
> To: [email protected]
> Subject: Re: Split token
>
> What you've shown would be handled with WhitespaceTokenizer, but you'd
> have
> to
> prevent filters from stripping the parens. If you have to handle things
> like
> blah ( stuff )
> WhitespaceTokenizer wouldn't work.
>
> PatternTokenizerFactory might work for you, see:
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternTokeniz
> erFactory.html
>
> Best
> Erick
>
> On Tue, Apr 12, 2011 at 6:02 AM, roySolr <[email protected]> wrote:
>
> > Hello,
> >
> > I want to split my string when it contains "(". Example:
> >
> > spurs (London)
> > Internationale (milan)
> >
> > to
> >
> > spurs
> > (london)
> > Internationale
> > (milan)
> >
> > What tokenizer can i use to fix this problem?
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Split-token-tp2810772p2810772.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >