Re: Eliminating duplicate result

2003-11-26 Thread Dragan Jotanovic
 When you are doing two searches are you searching for two different terms?
 

No, I am searching for the same term.






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: unexpected results from query

2003-11-26 Thread Erik Hatcher
On Tuesday, November 25, 2003, at 10:45  PM, marc wrote:
Hi,

assume a field has the following text

Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) 

the following searches all return this document

AMP
AMP
AMP;
can someone explain this to me..i figured that only the first query 
would be successful
This depends on the Analyzer you're using.  I'm assuming you're using 
the QueryParser and an analyzer that rips off special characters - so 
essentially the TermQuery underneath is always for AMP.

Have a look at my first java.net article which shows the analysis 
process.  Run your sample text through the code provided there to see 
the effect first-hand.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
woah that seems like an awfully complex answer to the question of 
how to tokenize at a comma rather than a space!  %-)

On Tuesday, November 25, 2003, at 11:48  AM, MOYSE Gilles (Cetelem) 
wrote:

Hi.

You should define expressions.
To define expressions, you first have to define an expression file.
An expression file contains one expressions per line.
For instance :
	time_out
	expert_system
	...
You can use any character to specify the expression link. Here, I 
use the
underscore (_).

Then, you have to build an expression loader. You can store 
expressions in
recursives HashMap.
Such HashMap must be built so that HashMap.get(word1) = HashMap, and
(HashMap.get(word1)).get(word2) = null, if you want to code the
expression word1_word2.
In other words 'HashMap.get(a_word)' returns a hashMap containing 
all the
successors of the word 'a_word'.

So, if your expression file looks like that :
time_out
expert_system
expert_in_information
you'll have to build a loader which returns a HashMap H so that :
H.keySet() = {time, expert}
((HashMap)H.get(time)).keySet = {out}
((HashMap)H.get(time)).get(out) = null // null indicates the end
of the expression
((HashMap)H.get(expert)).keySet = {system, in}
((HashMap)H.get(expert)).get(system) = null
((HashMap)((HashMap)H.get(expert)).get(in)).keySet() =
{information}
((HashMap)((HashMap)H.get(expert)).get(in)).get(information) =
null
These recursives HashMaps code the following tree :
time - out - null
system --- expert - null
  |- in - information- null
Such an expression loader may be designed this way :

public static HashMap getExpressionMap( File wordfile ) {
HashMap result = new HashMap();

try
{
String line = null;
LineNumberReader in = new LineNumberReader(new
FileReader(wordfile));
HashMap hashToAdd = null;

while ((line = in.readLine()) != null)
{
if (line.startsWith(FILE_COMMENT_CHARACTER))
continue;
if (line.trim().length() == 0)
continue;
StringTokenizer stok = new
StringTokenizer(line,  \t_);
String curTok = ;
HashMap currentHash = result;

// Test wether the expression contains 2 at
least words or not
if (stok.countTokens()  2)
{
	System.err.println(Warning : ' +
line + ' in file ' + wordfile.getAbsolutePath() + ' line  +
in.getLineNumber() +
		 is not an expression.\n\tA
valid expression contains at least 2 words.);
	continue;
}

while (stok.hasMoreTokens())
{
	curTok = stok.nextToken();
	if
(curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end 
of the
line, break
		break;
	if (stok.hasMoreTokens())
		hashToAdd = new HashMap(6);
	else
		hashToAdd = (HashMap)null;
		
	if
(!(currentHash.containsKey(curTok)))
		currentHash.put(curTok,
hashToAdd);
		
	currentHash =
(HashMap)currentHash.get(curTok);
}
			}
			return result;
		}
		// On error, use an empty table
		catch ( Exception e )
		{
			System.err.println(While processing ' +
wordfile.getAbsolutePath() + ' :  + e.getMessage());
			e.printStackTrace();
			return new HashMap();
		}
	}

Then, you must build a filter with 2 FIFO stacks : one is the 
expression
stack, the other is the default stack.
Then, you define a 'curMap' variable, initially pointing onto the 
HashMap
returned by the ExpressionFileLoader.

When you receive a token, you check wether it is null or not;
	If it is, you check if the standard stack is null or not.
		If it is not, you pop a token from the default stack and you
return it.
		If it is, you return null
	If it is not (the token is not null), you check whether it is
contained in the HashMap or not (curMap.containsKey(token)).
		If it is not contained and you were building an expression,
you pop all the terms in the expression stack to push them in the 
default
stack (so as not to loose information)
		If it is not contained and the default stack is empty, you
return the token.
		If it is not conatined and the default stack is not empty,
you return the poped token from the default stack and you push the 
current
token.
	If the token is contained in the curMap, then the token MAY be the
first element of an expression.
		You push the token in the expression stack, and you dive
into the next level in your expression tree (curMap = 
curMap.get(token))
		If the next level (now, curMap), is null, then you have
completed your expression. You can pop all the tokens from the 
expresion
stack to concatenate them, separated by underscores, and push the 
resulting
String as a token on the default heap (so as to 

Re: Tokenizing text custom way

2003-11-26 Thread Dragan Jotanovic
 You will need to write a custom analyzer.  Don't worry, though it's
 quite straightforward.  You will also need to write a Tokenizer, but
 Lucene helps you a lot here.

Wouldn't I achieve the same result if I index time out like time_out,
using StandardAnalyzer and later if I search for time out (inside quotes)
I should get proper result, but if I search for time I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Chinese input.

2003-11-26 Thread Otis Gospodnetic
Maybe this will help?

  http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23545

Otis

--- Tun Lin [EMAIL PROTECTED] wrote:
 Hi,
 
 May I know how do I analyse Chinese input from Chinese text in
 Lucene?
 
 Do I use Analyser function in Lucene? If yes, how to go about using
 it?
 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
On Wednesday, November 26, 2003, at 06:12  AM, Dragan Jotanovic wrote:
You will need to write a custom analyzer.  Don't worry, though 
it's
quite straightforward.  You will also need to write a Tokenizer, but
Lucene helps you a lot here.
Wouldn't I achieve the same result if I index time out like 
time_out,
using StandardAnalyzer and later if I search for time out (inside 
quotes)
I should get proper result, but if I search for time I shouldn't get
result. Is this right?
I'm confused on what you are planning doing.  Are you going to replace 
all spaces with an underscore before handing it to the analyzer?  
StandardAnalyzer will still split at the underscores though.

If you have special tokenization needs, why try to hack it somehow 
rather than address it cleanly in the way Lucene was designed to work?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Tokenizing text custom way

2003-11-26 Thread MOYSE Gilles (Cetelem)
Do you want to define expressions, i.e. a set of terms that must be
intpreted as a whole ?
For instance, when the Analyzer catchs time followed by out it returns
time_out ?


-Message d'origine-
De : Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 26 novembre 2003 12:12
À : Lucene Users List
Objet : Re: Tokenizing text custom way


 You will need to write a custom analyzer.  Don't worry, though it's
 quite straightforward.  You will also need to write a Tokenizer, but
 Lucene helps you a lot here.

Wouldn't I achieve the same result if I index time out like time_out,
using StandardAnalyzer and later if I search for time out (inside quotes)
I should get proper result, but if I search for time I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: log4j.properties

2003-11-26 Thread Stephane Vaucher
As I've said previously, it's a log4j problem and not a lucene probleme, 
you should post there.

sv

On Wed, 26 Nov 2003, Tun Lin wrote:

 I have created the following log4j.properties and put it in your classpath but
 it still has that error. Anyone can help?
 
 log4j.rootCategory=stdout
 
 log4j.appender.stdout=org.apache.log4j.ConsoleAppender
 log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
 log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: log4j.properties

2003-11-26 Thread Tun Lin
I have integrated Lucene and PDFBox and tried the following command to index
files 

java -Dlog4j.configuration=log4j.xml org.pdfbox.searchengine.lucene.IndexFiles
-create -index c:\\index .. 

But I have the following error message:
log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly.

Anyone can help?

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 26, 2003 5:19 PM
To: Lucene Users List
Subject: Re: log4j.properties

What does this have to do with Lucene?


On Wednesday, November 26, 2003, at 01:04  AM, Tun Lin wrote:

 I have created the following log4j.properties and put it in your 
 classpath but it still has that error. Anyone can help?

 log4j.rootCategory=stdout

 log4j.appender.stdout=org.apache.log4j.ConsoleAppender
 log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
 log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Question - not returning desired results

2003-11-26 Thread Pleasant, Tracy
Thanks this helps a lot :)

 



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 4:58 AM
To: Lucene Users List
Subject: Re: Search Question - not returning desired results


On Tuesday, November 25, 2003, at 12:11  PM, Pleasant, Tracy wrote:

 The documents I have index contain information regarding file names 
 also.

 For instance 'return_results.pl' or something like that may be in the 
 document fields.

 I am not understanding Lucene's way of searching:

 1. If I search for 'return_results', the search does not return 
 anything
 2. If I search for 'results' or 'return', the search does not return 
 anything
 3. If I search for 'results.pl', the search does return the document 
 containg 'return_results.pl'
 4. If I search for 'results~', the search does return the document 
 containg 'return_results.pl'
 5. If I search for 'return_results~', the search does not return 
 anything

 What is going on?

 I want it to return the document in all of the situations.

 I also don't want to have to use '~' all the time.

We sure do have a recurring theme lately :)  Analysis!

Please refer to my article at java.net:

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Look at the AnalysisDemo code.  Copy it over and try it out on the text 
you're using and the Analyzer you're using.  The bracketed text that 
comes out are the tokens that you can search on.  It is very very 
important to understand this process and to really know what terms come 
out of text you hand it - otherwise it is a mystery why some things can 
be found and some things cannot despite your expectations to the 
contrary.

A follow-up to the Analysis is querying - and QueryParser has it's own 
set of quirks and caveats related to how things are tokenized/analyzed. 
  And, I've got just the follow-up article for you handy...


http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html

If you digest both of these articles (analysis one first please) then I 
think a lot of questions that get asked on this list will be implicitly 
answered.  Understanding analysis is key.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Question - not returning desired results

2003-11-26 Thread Pleasant, Tracy
Erik,

I think there may be a typo in the website.

When I run the AnalyzerDemo :

Analzying xyz corporation - [EMAIL PROTECTED]
org.apache.lucene.analysis.standard.StandardAnalyzer:
[xyz] [corporation] [EMAIL PROTECTED] 

Your website says:

org.apache.lucene.analysis.standard.StandardAnalyzer:
[xyz] [corporation] [EMAIL PROTECTED] [com] 

When I run it it keeps the entire email '[EMAIL PROTECTED]
but according to your website it separates the '[EMAIL PROTECTED]' from the
'com'

Is there a difference between the versions of Lucene? I'm using 1.3rc2.

Plus I think what I want is a StandardAnalyzer with a little tweaking.
The simple one was fine until I realized that it doesn't do numbers,
which I need as part of my search since numbers is important for what
I'm doing. The Standard does numbers but I need it to be a little
different of course. Thanks for the site.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 4:58 AM
To: Lucene Users List
Subject: Re: Search Question - not returning desired results


On Tuesday, November 25, 2003, at 12:11  PM, Pleasant, Tracy wrote:

 The documents I have index contain information regarding file names 
 also.

 For instance 'return_results.pl' or something like that may be in the 
 document fields.

 I am not understanding Lucene's way of searching:

 1. If I search for 'return_results', the search does not return 
 anything
 2. If I search for 'results' or 'return', the search does not return 
 anything
 3. If I search for 'results.pl', the search does return the document 
 containg 'return_results.pl'
 4. If I search for 'results~', the search does return the document 
 containg 'return_results.pl'
 5. If I search for 'return_results~', the search does not return 
 anything

 What is going on?

 I want it to return the document in all of the situations.

 I also don't want to have to use '~' all the time.

We sure do have a recurring theme lately :)  Analysis!

Please refer to my article at java.net:

http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Look at the AnalysisDemo code.  Copy it over and try it out on the text 
you're using and the Analyzer you're using.  The bracketed text that 
comes out are the tokens that you can search on.  It is very very 
important to understand this process and to really know what terms come 
out of text you hand it - otherwise it is a mystery why some things can 
be found and some things cannot despite your expectations to the 
contrary.

A follow-up to the Analysis is querying - and QueryParser has it's own 
set of quirks and caveats related to how things are tokenized/analyzed. 
  And, I've got just the follow-up article for you handy...


http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html

If you digest both of these articles (analysis one first please) then I 
think a lot of questions that get asked on this list will be implicitly 
answered.  Understanding analysis is key.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Eliminating duplicate result

2003-11-26 Thread Pleasant, Tracy
You are searching for the same term and you are searching the same index twice, it 
will return the same results... 

I don't get what you are asking.


-Original Message-
From: Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2003 3:19 AM
To: Lucene Users List
Subject: Re: Eliminating duplicate result


 When you are doing two searches are you searching for two different terms?
 

No, I am searching for the same term.


What is the easyest way to eliminate duplicate documents if one is doing two searches 
on the same index?

Have anybody done something similar?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Question - not returning desired results

2003-11-26 Thread Erik Hatcher
On Wednesday, November 26, 2003, at 11:33  AM, Pleasant, Tracy wrote:
Your website says:

org.apache.lucene.analysis.standard.StandardAnalyzer:
[xyz] [corporation] [EMAIL PROTECTED] [com]
When I run it it keeps the entire email '[EMAIL PROTECTED]
but according to your website it separates the '[EMAIL PROTECTED]' from the
'com'
Is there a difference between the versions of Lucene? I'm using 1.3rc2.
Yes, I fixed the bug in the StandardTokenizer that caused e-mail 
addresses to get split, but fixed it after the article was written.  
Good eye!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Collaborative Filtering API

2003-11-26 Thread Steven J. Owens
On Tue, Nov 25, 2003 at 01:18:19PM -0500, Michael Giles wrote:
 Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota.

 I've actually worked on a system that bundled GroupLens.  I think
it was Vignette StoryServer.  The Vignette docs were incredibly dense
with MarketingNewSpeak, so I could never quite figure out what they
said GroupLens actually *did* (not at a web-capable terminal right
now, or I'd just google it).

 Collaborative filtering in general is a topic I'm interested in,
and is why I first got into Lucene.  I wanted and still want to build
a collaborative filtering search engine for mailing lists and the
like.

 I do remember that FireFly's engine was supposed to graph all of
the users' ratings on a topic in an N-dimensional space, and then find
users close to the same user in that N-dimensional space, and
suggest topics that they'd liked, but that the current user hadn't
rated.

 I'm interested in more of a free market sort of approach than
in statistical analysis; I want to build a system that helps usrs
express their opinions, then nurture an emerging consensus.  My
experience has been that systems that systems/technologies that try to
facilitate the way users already do things, instead of replacing them
with new ways of doing things, tend to work better.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Dates and others

2003-11-26 Thread Dion Almaer
Hi guys -

So I am getting happier with search, and just pushed the lucene version live at:

http://www.theserverside.com (on the leftbar) and:
http://www.theserverside.com/home/search/index.jsp

The only real item that I still want to tweak more is getting recent results higher in 
the list.

I was wondering if something like this could work (or if there is a better solution)

At index time, I have the date of the content.  I could do some math where the higher 
the date
(based on the time_t version or whatever) the more of a setBoost(metric). Or, for 
every month in the
past, create a larger negative number to setBoost()... or something like that.

Would something like this make sense?

Dion


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, November 23, 2003 3:52 PM
 To: Lucene Users List
 Subject: Re: Dates and others
 
 On Saturday, November 22, 2003, at 06:33  PM, Dion Almaer wrote:
  3. I have some fields suck as title, owner, etc as well as 
 the content 
  blob which I index and use as the default search field.  Is 
 there an 
  easy way to extend the QueryParser to merge it with a 
 MultiTermQuery 
  which can also search this meta data and give them certain 
 weights?  
  Or, if you go down this path do you have to leave the QueryParser 
  behind and build your own queries?  Any best practices 
 would be great.
 
 And Ype said:
 You can provide field weights at document indexing time 
 (norms) and use a MultiTermQuery for searching multiple 
 fields. At query time you can again use field weights.
 I don't know how the scoring of the MultiTermQuery is done, 
 it might use the max. score over the fields of a document, or 
 combine the scores in the fields of a document.
  end Ype's reply cut and paste
 
 I'm a little confused with this question and Ype's reply.  
 MultiTermQuery is an abstract base class under Query, which 
 is the parent for WildcardQuery and FuzzyQuery.
 
 What I think you're after is using MultiFieldQueryParser, but 
 you want to weight the fields differently.  You can add the 
 boosts at indexing time using Field.setBoost.  Unfortunately 
 at the moment MultiFieldQueryParser is not very extensible - 
 there are some open issues with its subclassability but 
 subclassing MFQP and overriding getFieldQuery will do the 
 trick when the subclassing issues are resolved allowing you 
 to boost at query time.
 
 Making an educated guess at what you're doing with Lucene, 
 Dion, I'd venture to say that boosting at indexing time is 
 sufficient for your needs.
 
   Erik
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]