subject:"RE\: Keep URLs intact and not tokenized by the StandardTokenizer"

RE: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Delbru, Renaud

://github.com/rdelbru/lucene-uri-preserving-standard-tokenizer -- Renaud Delbru -Original Message- From: Sudha Verma [mailto:verma.su...@gmail.com] Sent: Thu 11/19/2009 9:35 PM To: java-user@lucene.apache.org Subject: Re: Keep URLs intact and not tokenized by the StandardTokenizer Thanks. I was

Re: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Sudha Verma

Thanks. I was hoping Lucene would already have a solution for this since it seems like it would be a common problem. I am new to the lucene API. If I were to implement something from scratch, are my options to extend the Tokenizer to support http regex and then pass the text to StandardTokenizer.

RE: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Steven A Rowe

Hi Sudha, In the past, I've built regexes to recognize URLs using the information here: http://www.foad.org/~abigail/Perl/url2.html The above, however, is currently a dead link. Here's the Internet Archive's WayBack Machine's cache of this page from August 2007: