Re: URL Tokenization

Sudha Verma Fri, 25 Jun 2010 20:35:00 -0700

Thanks,

That worked from Lucene API.


Because the code is not fully released, some of it had build errors. Nothing
big. I ran into a few compile errors because the path for some of the
analysis classes got changed to standard/ or core/...A lot of the import
statements in solr source from that trunk still point to analysis (e.g.
import org.apache.lucene.analysis.Tokenizer; import
org.apache.lucene.analysis.tokenattributes.OffsetAttribute;).

-Sudha


On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe <sar...@syr.edu> wrote:

> Hi Sudha,
>
> There is such a tokenizer, named NewStandardTokenizer, in the most recent
> patch on the following JIRA issue:
>
>   https://issues.apache.org/jira/browse/LUCENE-2167
>
> It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and
> e-mails too, in accordance with the relevant IETF RFCs.
>
> Steve
>
> > -----Original Message-----
> > From: Sudha Verma [mailto:verma.su...@gmail.com]
> > Sent: Wednesday, June 23, 2010 2:07 PM
> > To: java-user@lucene.apache.org
> > Subject: URL Tokenization
> >
> > Hi,
> >
> > I am new to lucene and I am using Lucene 3.0.2.
> >
> > I am using Lucene to parse text which may contain URLs. I noticed the
> > StandardTokenizer keeps the email addresses in one token, but not the
> > URLs.
> > I also looked at Solr wiki pages, and even though the wiki page for
> > solr.StandardTokenizerFactory says it keeps track of the URL token type -
> > it does not seem to be the case.
> >
> > Is there an Analyzer implementation that can keep the URLs intact into
> one
> > token? or does anyone have an example of that for Solr or Lucene?
> >
> > Thanks much,
> > Sudha
>

Re: URL Tokenization

Reply via email to