Hi Sudha,

There is such a tokenizer, named NewStandardTokenizer, in the most recent patch 
on the following JIRA issue: 

   https://issues.apache.org/jira/browse/LUCENE-2167

It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails 
too, in accordance with the relevant IETF RFCs.

Steve

> -----Original Message-----
> From: Sudha Verma [mailto:verma.su...@gmail.com]
> Sent: Wednesday, June 23, 2010 2:07 PM
> To: java-user@lucene.apache.org
> Subject: URL Tokenization
> 
> Hi,
> 
> I am new to lucene and I am using Lucene 3.0.2.
> 
> I am using Lucene to parse text which may contain URLs. I noticed the
> StandardTokenizer keeps the email addresses in one token, but not the
> URLs.
> I also looked at Solr wiki pages, and even though the wiki page for
> solr.StandardTokenizerFactory says it keeps track of the URL token type -
> it does not seem to be the case.
> 
> Is there an Analyzer implementation that can keep the URLs intact into one
> token? or does anyone have an example of that for Solr or Lucene?
> 
> Thanks much,
> Sudha

Reply via email to