Hi Sudha, Sorry, I should have mentioned that the existing patch is intended for use only against the trunk version (i.e., version 4.0-dev).
Instructions for checking out a working copy from Subversion are here: http://wiki.apache.org/lucene-java/HowToContribute Once you've done that, change directory to the root directory of the checked out working copy and apply the patch, like you did previously. Steve > -----Original Message----- > From: Sudha Verma [mailto:verma.su...@gmail.com] > Sent: Thursday, June 24, 2010 12:57 PM > To: java-user@lucene.apache.org > Subject: Re: URL Tokenization > > Hi Steve, > > Thanks for the quick reply and implementing support for URL tokenization. > Another newbie question about applying this patch. > > I have the Lucene 3.0.2 source and I downloaded the patch and tried to > apply > it: > > lucene-3.0.2> patch -p0 < LUCENE-2167.patch > > Comes back with the error message: > > ....(output truncated) > can't find file to patch at input line 13106 Perhaps you used the wrong -p > or --strip option? > The text leading up to this was: > > > After looking at the line, it looks like it's trying to find > modules/analysis/common/build.xml -- which is not part of the official > 3.0.2 src release. And thinking about it, may be I need to use the latest > source (or a nightly build). But, I couldn't figure how to get that. The > hudson link for nightly builds on the apache-lucene site seems to be > broke. Or may be I have a different problem. > > I'd appreciate any help. > > Thanks, > Sudha > > > > On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe <sar...@syr.edu> wrote: > > > Hi Sudha, > > > > There is such a tokenizer, named NewStandardTokenizer, in the most > > recent patch on the following JIRA issue: > > > > https://issues.apache.org/jira/browse/LUCENE-2167 > > > > It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and > > e-mails too, in accordance with the relevant IETF RFCs. > > > > Steve > > > > > -----Original Message----- > > > From: Sudha Verma [mailto:verma.su...@gmail.com] > > > Sent: Wednesday, June 23, 2010 2:07 PM > > > To: java-user@lucene.apache.org > > > Subject: URL Tokenization > > > > > > Hi, > > > > > > I am new to lucene and I am using Lucene 3.0.2. > > > > > > I am using Lucene to parse text which may contain URLs. I noticed > > > the StandardTokenizer keeps the email addresses in one token, but > > > not the URLs. > > > I also looked at Solr wiki pages, and even though the wiki page for > > > solr.StandardTokenizerFactory says it keeps track of the URL token > > > type - it does not seem to be the case. > > > > > > Is there an Analyzer implementation that can keep the URLs intact > > > into > > one > > > token? or does anyone have an example of that for Solr or Lucene? > > > > > > Thanks much, > > > Sudha > >