worked like a charm! thx!
________________________________________ From: Jack Krupansky [j...@basetechnology.com] Sent: Thursday, June 14, 2012 3:30 PM To: java-user@lucene.apache.org Subject: Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers Look at this code: QueryTermExtractor.getTerms(Query query) http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html -- Jack Krupansky -----Original Message----- From: Ilya Zavorin Sent: Thursday, June 14, 2012 2:36 PM To: java-user@lucene.apache.org Subject: RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers Uwe, sorry but I am having trouble understanding this. Can you point me to a place in documentation that explains this in more detail (I've read http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html but still am confused) or some example code? Thanks much, Ilya -----Original Message----- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Thursday, June 14, 2012 12:57 PM To: java-user@lucene.apache.org Subject: RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers Just take the BooleanQuery returned by the QueryParser and get its clauses (sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that you get all query components. In most cases some recursive instanceof checking for various Query subclasses can do this. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Ilya Zavorin [mailto:izavo...@caci.com] > Sent: Thursday, June 14, 2012 6:49 PM > To: java-user@lucene.apache.org > Subject: RE: need to find locations of query hits in doc: works fine > for regular > text but not for phone numbers > > OK, so I figured out what the problem was. It wasn't with the digits > but rather > with the various delimiters like "(" and "-" that I use. > > Essentially, the statement > > String[] subTerms = qstr.split("\\s+"); > > Does not split a query the same way as the query parser would do it. > And thanks, query.toString(), helped me see that. > > My question now is this: is there a way of easily extracting a > sequence of substrings from query to use in place of the subTerms > array I get from split? > > I see that sometimes query.toString() returns things like > > "contents:800 contents:555 contents:1212" > > but other times it's somehting like > > "contents:800 (contents:555 contents:1212)" > > So instead of trying to guess what other formats query.toString can produce > and trying to parse those, can I somehow extract the substrings of the query > reliably? > > Thanks! > > > -----Original Message----- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Wednesday, June 13, 2012 11:42 PM > To: java-user@lucene.apache.org > Subject: Re: need to find locations of query hits in doc: works fine > for regular > text but not for phone numbers > > Try putting the phone number in quotes in the query: > > String qstr = "\"800-555-1212\""; > > And check query.toString to see how the query parser analyzed the > term, bot > with and without quotes. > > And make sure you initialized the query parser with "contents" as the default > field. > > -- Jack Krupansky > > -----Original Message----- > From: Ilya Zavorin > Sent: Wednesday, June 13, 2012 10:52 PM > To: java-user@lucene.apache.org > Subject: need to find locations of query hits in doc: works fine for regular text > but not for phone numbers > > Hello All, > > I am using 3.4. I need to find locations of query hits in a document. > What I've > implemented works fine for textual queries but does not work for phone > numbers. > > Here's how I index my docs: > > String oc = "Joe dialed 800-555-1212 but got a busy signal"; > doc.add(new Field("contents", oc, Field.Store.NO, > Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); > > > Now, here how I find locations. I search for a query. If I get a hit, > I split my > query (in case it's multi-word) into words and search for each of them using > TermFreqVector like this: > > > //String qstr = "my multiword query"; // for queries like this it > works fine... > String qstr = "800-555-1212"; // ...but not for ones like this Query > query = > parser.parse(qstr); TopDocs results = searcher.search(query, > Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs; > > String[] subTerms = qstr.split("\\s+"); // phone string stays intact > here > > for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc; > Document doc = searcher.doc(docId); > > TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents"); > TermPositionVector tpvector = (TermPositionVector)tfvector; > > for (String subTerm : subTerms) > { > String subq = subTerm.toLowerCase(); > int termidx = tfvector.indexOf(subq); // get termidx = -1 here > > TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx); > for (int j=0;j<tvoffsetinfo.length;j++) { > int offsetStart = tvoffsetinfo[j].getStartOffset(); > int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ... > > For a query like "800-555-1212", tfvector.indexOf returns -1. What am > I doing > wrong? > > Thanks, > > Ilya Zavorin > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org