Modifying idf()?

2010-07-30 Thread Pablo Mendes
Hi all, I'd like to do a very simple change to the idf computation, but I can't seem to wrap my head around it. There are very useful hints in the javadocs for "Changing Similarity" for new tf() and lengthNorm() behavior, but it was a little bit blurrier for idf() http://lucene.apache.org/java/3_0

Re: Modifying IDF

2010-02-01 Thread Franz Allan Valencia See
Hmm My Analyzer is a Dictionary-based Analyzer. And so, it only recognizes tokens in its dictionary. Adding every url (or domain) is not a viable solution. So how could I include that to my analyzer? Lucene Filter? FilterReader? Thanks, -- Franz Allan Valencia See | Java Software Engineer

Re: Modifying IDF

2010-01-30 Thread Ian Lea
Are you asking how to get lucene.apache.org out of http://lucene.apache.org/ or how to get apache.org out of lucene.apache.org? The getHost() method of java.net.URL will give you the former. Or use a regexp. I don't know an easy way to do the latter, but depending on your requirements you could s

Re: Modifying IDF

2010-01-29 Thread Franz Allan Valencia See
How should I go about identifying the domain? Thanks, -- Franz Allan Valencia See | Java Software Engineer franz@gmail.com LinkedIn: http://www.linkedin.com/in/franzsee Twitter: http://www.twitter.com/franz_see On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea wrote: > Instead of playing around wi

Re: Modifying IDF

2010-01-29 Thread Ian Lea
Instead of playing around with tf/idf, how about just indexing and searching the domain. -- Ian. On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See wrote: > Good day, > > I am currently using lucene for my searches. And one of the problems that Im > facing is when keyword is a url. The

Modifying IDF

2010-01-28 Thread Franz Allan Valencia See
Good day, I am currently using lucene for my searches. And one of the problems that Im facing is when keyword is a url. The tokens such as http, https, ://, index, html, etc seems to be messing up with our search results. The focus was supposed to be only on the url domain. The idea that I have i