Steve,

Thanks much for the link: very useful!

I looked at the index and found that it contains terms like

electricitythis -- from Doc 3
pain.electricity -- from Doc 1

sentence.he -- from Doc 1

It appears that there is some sort of issue with handling end-of-lines. What do 
I need to change at index time for this to work properly?


Not sure whether this is relevant, but the text files has been saved as UTF8 
even though they are ASCII. I need to handle foreign text so I assume all files 
that I index are UTF8.

I am using the standard analyzer for English text and other contributed 
analyzers for respective foreign texts


Thanks,

Ilya







-----Original Message-----
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: Monday, March 26, 2012 10:59 AM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

Hi Ilya,

What analyzers are you using at index-time and query-time?

My guess is that you're using an analyzer that includes punctuation in the 
tokens it emits, in which case your index will have things like "sentence." and 
"sentence?" in it, so querying for "sentence" will not match.

Luke can tell you what's in your index: <http://code.google.com/p/luke/>

Steve

-----Original Message-----
From: Ilya Zavorin [mailto:izavo...@caci.com] 
Sent: Monday, March 26, 2012 10:11 AM
To: java-user@lucene.apache.org
Subject: can't find common words -- using Lucene 3.4.0 

I am writing a Lucene based indexing-search app and testing it using some 
simple docs and querries. I have 3 simples docs that are shown at the bottom of 
the this email between pairs of "==================="s and about a dozen terms. 
One of them is "electricity". As you can see, it appears in all three docs. 
However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or 
Doc 3. 

Why is this happening? 

Another query that appears in all three but found in only some is "sentence". I 
have a bunch of other querries that only appear in one of the three docs and 
these are all found correctly. 

Is this an indication that I have either set parameers incorrectly when 
indexing or set up the quesrries incorrectly (or both)? 

Here's how I search:

String qstr = "sentence";
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = 
results.scoreDocs;

I am using Lucene 3.4.0

Thanks much,

Ilya



Doc 1: 
===================
BALTIMORE - Ricky Williams sits alone.

Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an 
NFL career.
(US Presswire)
Inside the Baltimore Ravens' locker room the air is alive. Players argue about 
a bean-bag toss game they play after practices, then mock a teammate who has 
inexplicably decided to do an interview naked. Music thumps. Giant men laugh, 
and their laughter rattles off cinder block walls in the symphony of a football 
team that feels invincible.
Only Ricky Williams sits alone. Here is sentence.
He is huddled on a stool in front of his locker, sweat clothes on, ready to 
leave. It's a strange image, loaded with contrasts. He doesn't belong here, not 
with these men, many of whom are almost 10 years younger than him. And yet he 
feels very much at home. He isn't the star on this team, which is two wins from 
the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive 
bowling ball of a man who in the spirit of running backs relishes the chance to 
run the ball 25 times a game. Williams is an afterthought, a backup who has 
carried the ball more than 12 times in only one game this season. Often he 
might have the ball in his hands on only four or five plays, and this is fine 
with him. In fact he prefers it. His body has absorbed enough beatings for one 
lifetime. Let someone else get the pain.

electricity


===================

Doc 2:
===================
Dear Cecil:
This question has gnawed at me since I was a young boy. It is a question posed 
every day by countless thousands around the globe and yet I have never heard 
even one remotely legitimate answer. How much wood would a woodchuck chuck if a 
woodchuck could chuck wood?
- R.F.B., Arlington, Virginia
Cecil replies: Is here sentence?
Are you kidding? Everybody knows a woodchuck would chuck as much wood as a 
woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting 
to know why she sells seashells by the seashore.

common term is electricity


===================

Doc 3:
===================
CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the 
right to hold the earliest presidential primary, fending off bigger states that 
claimed that the small New England state was too white to represent the 
nation's diverse population. Sentence is here.
In its defense, New Hampshire jokingly brags that its voters won't pick a 
presidential candidate until they've met at least three times face-to-face _ 
rather than seeing the person in television ads or at large events typical of 
bigger states. New Hampshire voters expect to shake hands with candidates at 
coffees that supporters host in their homes or at backyard barbecues.
That tradition paid off in 1976 for a little-known peanut farmer and former 
Georgia governor. Jimmy Carter won in New Hampshire and went on to become 
president.

word Hampshire by itself

this state has electricity

This is a state in the United states of America. Here is one term: United 
America. And Here's another one: States america. And here's yet another == 
UNITED STATES! Here we are dropping the middle stopword: United States          
    America. Finally, we get one word: united. Then the second one: STates. 
Then the final one: America.

===================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to