Hi Sudha, There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following JIRA issue:
https://issues.apache.org/jira/browse/LUCENE-2167 It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails too, in accordance with the relevant IETF RFCs. Steve > -----Original Message----- > From: Sudha Verma [mailto:verma.su...@gmail.com] > Sent: Wednesday, June 23, 2010 2:07 PM > To: java-user@lucene.apache.org > Subject: URL Tokenization > > Hi, > > I am new to lucene and I am using Lucene 3.0.2. > > I am using Lucene to parse text which may contain URLs. I noticed the > StandardTokenizer keeps the email addresses in one token, but not the > URLs. > I also looked at Solr wiki pages, and even though the wiki page for > solr.StandardTokenizerFactory says it keeps track of the URL token type - > it does not seem to be the case. > > Is there an Analyzer implementation that can keep the URLs intact into one > token? or does anyone have an example of that for Solr or Lucene? > > Thanks much, > Sudha