Steve, Thanks much for the link: very useful!
I looked at the index and found that it contains terms like electricitythis -- from Doc 3 pain.electricity -- from Doc 1 sentence.he -- from Doc 1 It appears that there is some sort of issue with handling end-of-lines. What do I need to change at index time for this to work properly? Not sure whether this is relevant, but the text files has been saved as UTF8 even though they are ASCII. I need to handle foreign text so I assume all files that I index are UTF8. I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts Thanks, Ilya -----Original Message----- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Monday, March 26, 2012 10:59 AM To: java-user@lucene.apache.org Subject: RE: can't find common words -- using Lucene 3.4.0 Hi Ilya, What analyzers are you using at index-time and query-time? My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match. Luke can tell you what's in your index: <http://code.google.com/p/luke/> Steve -----Original Message----- From: Ilya Zavorin [mailto:izavo...@caci.com] Sent: Monday, March 26, 2012 10:11 AM To: java-user@lucene.apache.org Subject: can't find common words -- using Lucene 3.4.0 I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==================="s and about a dozen terms. One of them is "electricity". As you can see, it appears in all three docs. However, when I search for it, I only get a hit in Doc 2 but not in Doc 1 or Doc 3. Why is this happening? Another query that appears in all three but found in only some is "sentence". I have a bunch of other querries that only appear in one of the three docs and these are all found correctly. Is this an indication that I have either set parameers incorrectly when indexing or set up the quesrries incorrectly (or both)? Here's how I search: String qstr = "sentence"; Query query = parser.parse(qstr); TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs; I am using Lucene 3.4.0 Thanks much, Ilya Doc 1: =================== BALTIMORE - Ricky Williams sits alone. Ricky Williams is one of 26 running backs to eclipse the 10,000-yard mark in an NFL career. (US Presswire) Inside the Baltimore Ravens' locker room the air is alive. Players argue about a bean-bag toss game they play after practices, then mock a teammate who has inexplicably decided to do an interview naked. Music thumps. Giant men laugh, and their laughter rattles off cinder block walls in the symphony of a football team that feels invincible. Only Ricky Williams sits alone. Here is sentence. He is huddled on a stool in front of his locker, sweat clothes on, ready to leave. It's a strange image, loaded with contrasts. He doesn't belong here, not with these men, many of whom are almost 10 years younger than him. And yet he feels very much at home. He isn't the star on this team, which is two wins from the Super Bowl. The bulk of the offense is carried by Ray Rice, an effusive bowling ball of a man who in the spirit of running backs relishes the chance to run the ball 25 times a game. Williams is an afterthought, a backup who has carried the ball more than 12 times in only one game this season. Often he might have the ball in his hands on only four or five plays, and this is fine with him. In fact he prefers it. His body has absorbed enough beatings for one lifetime. Let someone else get the pain. electricity =================== Doc 2: =================== Dear Cecil: This question has gnawed at me since I was a young boy. It is a question posed every day by countless thousands around the globe and yet I have never heard even one remotely legitimate answer. How much wood would a woodchuck chuck if a woodchuck could chuck wood? - R.F.B., Arlington, Virginia Cecil replies: Is here sentence? Are you kidding? Everybody knows a woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood. Next you'll be wanting to know why she sells seashells by the seashore. common term is electricity =================== Doc 3: =================== CONCORD, N.H. (AP) - For 60 years, New Hampshire has jealously guarded the right to hold the earliest presidential primary, fending off bigger states that claimed that the small New England state was too white to represent the nation's diverse population. Sentence is here. In its defense, New Hampshire jokingly brags that its voters won't pick a presidential candidate until they've met at least three times face-to-face _ rather than seeing the person in television ads or at large events typical of bigger states. New Hampshire voters expect to shake hands with candidates at coffees that supporters host in their homes or at backyard barbecues. That tradition paid off in 1976 for a little-known peanut farmer and former Georgia governor. Jimmy Carter won in New Hampshire and went on to become president. word Hampshire by itself this state has electricity This is a state in the United states of America. Here is one term: United America. And Here's another one: States america. And here's yet another == UNITED STATES! Here we are dropping the middle stopword: United States America. Finally, we get one word: united. Then the second one: STates. Then the final one: America. =================== --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org