Thanks, That worked from Lucene API.
Because the code is not fully released, some of it had build errors. Nothing big. I ran into a few compile errors because the path for some of the analysis classes got changed to standard/ or core/...A lot of the import statements in solr source from that trunk still point to analysis (e.g. import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;). -Sudha On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Sudha, > > There is such a tokenizer, named NewStandardTokenizer, in the most recent > patch on the following JIRA issue: > > https://issues.apache.org/jira/browse/LUCENE-2167 > > It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and > e-mails too, in accordance with the relevant IETF RFCs. > > Steve > > > -----Original Message----- > > From: Sudha Verma [mailto:verma.su...@gmail.com] > > Sent: Wednesday, June 23, 2010 2:07 PM > > To: java-user@lucene.apache.org > > Subject: URL Tokenization > > > > Hi, > > > > I am new to lucene and I am using Lucene 3.0.2. > > > > I am using Lucene to parse text which may contain URLs. I noticed the > > StandardTokenizer keeps the email addresses in one token, but not the > > URLs. > > I also looked at Solr wiki pages, and even though the wiki page for > > solr.StandardTokenizerFactory says it keeps track of the URL token type - > > it does not seem to be the case. > > > > Is there an Analyzer implementation that can keep the URLs intact into > one > > token? or does anyone have an example of that for Solr or Lucene? > > > > Thanks much, > > Sudha >