OK, so I figured out what the problem was. It wasn't with the digits but rather
with the various delimiters like "(" and "-" that I use.
Essentially, the statement
String[] subTerms = qstr.split("\\s+");
Does not split a query the same way as the query parser would do it. And
thanks, query.toString(), helped me see that.
My question now is this: is there a way of easily extracting a sequence of
substrings from query to use in place of the subTerms array I get from split?
I see that sometimes query.toString() returns things like
"contents:800 contents:555 contents:1212"
but other times it's somehting like
"contents:800 (contents:555 contents:1212)"
So instead of trying to guess what other formats query.toString can produce and
trying to parse those, can I somehow extract the substrings of the query
reliably?
Thanks!
-----Original Message-----
From: Jack Krupansky [mailto:[email protected]]
Sent: Wednesday, June 13, 2012 11:42 PM
To: [email protected]
Subject: Re: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers
Try putting the phone number in quotes in the query:
String qstr = "\"800-555-1212\"";
And check query.toString to see how the query parser analyzed the term, bot
with and without quotes.
And make sure you initialized the query parser with "contents" as the default
field.
-- Jack Krupansky
-----Original Message-----
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: [email protected]
Subject: need to find locations of query hits in doc: works fine for regular
text but not for phone numbers
Hello All,
I am using 3.4. I need to find locations of query hits in a document. What I've
implemented works fine for textual queries but does not work for phone numbers.
Here's how I index my docs:
String oc = "Joe dialed 800-555-1212 but got a busy signal"; doc.add(new
Field("contents", oc, Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
Now, here how I find locations. I search for a query. If I get a hit, I split
my query (in case it's multi-word) into words and search for each of them using
TermFreqVector like this:
//String qstr = "my multiword query"; // for queries like this it works fine...
String qstr = "800-555-1212"; // ...but not for ones like this Query query =
parser.parse(qstr); TopDocs results = searcher.search(query,
Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
String[] subTerms = qstr.split("\\s+"); // phone string stays intact here
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);
TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;
for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq); // get termidx = -1 here
TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;j<tvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
For a query like "800-555-1212", tfvector.indexOf returns -1. What am I doing
wrong?
Thanks,
Ilya Zavorin
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]