All, I'm investigating the use of Lucene as a search engine, and have been doing some 'proof-of-concept' coding today. I'm indexing about 650 text files, and then searching against them using QueryParser. Here's the indexing code snippet:
<snip> public static void Result(IndexWriter indexWriter, File file) throws FileNotFoundException { Document document = null; String content = ""; BufferedReader br = new BufferedReader(new FileReader(file)); boolean EOF = false; try { while(!EOF) { String s = (String) br.readLine(); if (null == s) { EOF = true; } else { if (!"".equals(s) && "CC>".equals(s.substring(0, 3))) { document = new Document(); document.add(Field.Text("account", s.substring(3, 7))); document.add(Field.Keyword("created", s.substring(s.indexOf("DC>") + 3, s.indexOf("DC>") + 11))); content = new String(); } else if (!"".equals(s) && "AN>".equals(s.substring(0, 3))) { document.add(Field.Keyword("lastname", s.substring(3, 28).trim().toLowerCase())); document.add(Field.Keyword("firstname", s.substring(28, 43).trim().toLowerCase())); document.add(Field.Text("name", s.substring(28, 43).trim() + " " + s.substring(3, 28).trim())); document.add(Field.Keyword("controlnumber", s.substring(44, 52))); document.add(Field.Keyword("status", s.substring(52, 53).trim())); document.add(Field.Keyword("ssn", s.substring(53, 62))); document.add(Field.Keyword("dob", s.substring(62, 70))); document.add(Field.Keyword("collected", s.substring(137, 145))); } else if (!"".equals(s) && "<FF".equals(s.substring(0, 3))) { document.add(Field.UnStored("content", content)); indexWriter.addDocument(document); } else { content = content + s + "\n"; } } } br.close(); } catch(IOException ioe) { System.out.println(ioe.getClass() + " caught with message " + ioe.getMessage()); } } </snip> The text files have two control lines at the beginning of them - CC> and AN>. I extract particular fields from these lines and add them to my document. Everything (I think) indexes correctly. When I search against this index, though, I get some weird results, especially when using an '*' at the end of my criteria. Here's the search code snippet: <snip> public static void main(String[] args) { try { Searcher searcher = new IndexSearcher("c:\\ResultIndex"); Analyzer analyzer = new StandardAnalyzer(); BufferedReader br= new BufferedReader(new InputStreamReader(System.in)); while(true) { System.out.println("Query: "); String s = br.readLine(); if (null == s) { break; } else { Query query = QueryParser.parse(s, "content", analyzer); System.out.println("Searching for: " + query.toString("content")); Hits hits = searcher.search(query); System.out.println("... Found " + hits.length() + " matching documents"); System.out.println(""); for (int i = 0; i < hits.length(); i++) { Document document = hits.doc(i); System.out.println("Hit " + i + ": Specimen = " + document.get("controlnumber") + ", Account = " + document.get("account") + ", Status = " + document.get("status") + ", Name = " + document.get("name") + ", SSN = " + document.get("ssn") + ", DOB = " + document.get("dob") + ", Collected = " + document.get("collected") + ", Created = " + document.get("created")); //System.out.println(document.get("content")); } } } } catch(Exception e) { System.out.println(e.getClass() + " caught with message " + e.getMessage()); } } </snip> When I run this using a criteria string of lastname:mar* I get back the following: Query: lastname:mar* Searching for: lastname:mar* ... Found 9 matching documents Hit 0: Specimen = 40062720, Account = 0001, Status = N, Name = LOIS MARTIN, SSN = 536628498, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 1: Specimen = 38843845, Account = 4NEK, Status = N, Name = RENEE CAPPETTA, SSN = 585132901, DOB = 19010101, Collected = 20050117, Created = 20050119 Hit 2: Specimen = 39894441, Account = 3384, Status = N, Name = LINDA CANTU, SSN = 453539817, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 3: Specimen = 39894441, Account = 3384, Status = N, Name = LINDA CANTU, SSN = 453539817, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 4: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 5: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 6: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 7: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created = 20050119 Hit 8: Specimen = 38247027, Account = 23SQ, Status = N, Name = ROBERT BASTOW, SSN = 528960058, DOB = 19010101, Collected = 20050118, Created = 20050119 I'm at a loss to explain why I'm getting hits 1 - 8 - the lastnames don't start with mar! I suspect it is due to an incorrect use of Field.Keyword vs Field.Text in the indexer, but I can seem to figure it out... Thanks. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]