RE: URL Tokenization

Steven A Rowe Thu, 24 Jun 2010 10:34:54 -0700

Hi Sudha,

Sorry, I should have mentioned that the existing patch is intended for use only 
against the trunk version (i.e., version 4.0-dev).


Instructions for checking out a working copy from Subversion are here:

   http://wiki.apache.org/lucene-java/HowToContribute

Once you've done that, change directory to the root directory of the checked 
out working copy and apply the patch, like you did previously.

Steve

> -----Original Message-----
> From: Sudha Verma [mailto:verma.su...@gmail.com]
> Sent: Thursday, June 24, 2010 12:57 PM
> To: java-user@lucene.apache.org
> Subject: Re: URL Tokenization
> 
> Hi Steve,
> 
> Thanks for the quick reply and implementing support for URL tokenization.
> Another newbie question about applying this patch.
> 
> I have the Lucene 3.0.2 source and I downloaded the patch and tried to
> apply
> it:
> 
> lucene-3.0.2> patch -p0 < LUCENE-2167.patch
> 
> Comes back with the error message:
> 
> ....(output truncated)
> can't find file to patch at input line 13106 Perhaps you used the wrong -p
> or --strip option?
> The text leading up to this was:
> 
> 
> After looking at the line, it looks like it's trying to find
> modules/analysis/common/build.xml -- which is not part of the official
> 3.0.2 src release. And thinking about it, may be I need to use the latest
> source (or a nightly build). But, I couldn't figure how to get that. The
> hudson link for nightly builds on the apache-lucene site seems to be
> broke. Or may be I have a different problem.
> 
> I'd appreciate any help.
> 
> Thanks,
> Sudha
> 
> 
> 
> On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe <sar...@syr.edu> wrote:
> 
> > Hi Sudha,
> >
> > There is such a tokenizer, named NewStandardTokenizer, in the most
> > recent patch on the following JIRA issue:
> >
> >   https://issues.apache.org/jira/browse/LUCENE-2167
> >
> > It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and
> > e-mails too, in accordance with the relevant IETF RFCs.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: Sudha Verma [mailto:verma.su...@gmail.com]
> > > Sent: Wednesday, June 23, 2010 2:07 PM
> > > To: java-user@lucene.apache.org
> > > Subject: URL Tokenization
> > >
> > > Hi,
> > >
> > > I am new to lucene and I am using Lucene 3.0.2.
> > >
> > > I am using Lucene to parse text which may contain URLs. I noticed
> > > the StandardTokenizer keeps the email addresses in one token, but
> > > not the URLs.
> > > I also looked at Solr wiki pages, and even though the wiki page for
> > > solr.StandardTokenizerFactory says it keeps track of the URL token
> > > type - it does not seem to be the case.
> > >
> > > Is there an Analyzer implementation that can keep the URLs intact
> > > into
> > one
> > > token? or does anyone have an example of that for Solr or Lucene?
> > >
> > > Thanks much,
> > > Sudha
> >

RE: URL Tokenization

Reply via email to