Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Jack Krupansky Thu, 14 Jun 2012 12:30:57 -0700

Look at this code: QueryTermExtractor.getTerms(Query query)
http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html


-- Jack Krupansky

-----Original Message-----From: Ilya Zavorin

Sent: Thursday, June 14, 2012 2:36 PM
To: [email protected]

Subject: RE: need to find locations of query hits in doc: works fine forregular text but not for phone numbers

Uwe, sorry but I am having trouble understanding this. Can you point me to aplace in documentation that explains this in more detail (I've readhttp://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.htmlbut still am confused) or some example code?


Thanks much,

Ilya


-----Original Message-----
From: Uwe Schindler [mailto:[email protected]]
Sent: Thursday, June 14, 2012 12:57 PM
To: [email protected]

Subject: RE: need to find locations of query hits in doc: works fine forregular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By thatyou get all query components. In most cases some recursive instanceofchecking for various Query subclasses can do this.


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

-----Original Message-----
From: Ilya Zavorin [mailto:[email protected]]
Sent: Thursday, June 14, 2012 6:49 PM
To: [email protected]
Subject: RE: need to find locations of query hits in doc: works fine
for

regular

text but not for phone numbers

OK, so I figured out what the problem was. It wasn't with the digits
but

rather

with the various delimiters like "(" and "-" that I use.

Essentially, the statement

String[] subTerms = qstr.split("\\s+");

Does not split a query the same way as the query parser would do it.
And thanks, query.toString(), helped me see that.

My question now is this: is there a way of easily extracting a
sequence of substrings from query to use in place of the subTerms
array I get from

split?


I see that sometimes query.toString() returns things like

"contents:800 contents:555 contents:1212"

but other times it's somehting like

"contents:800 (contents:555 contents:1212)"

So instead of trying to guess what other formats query.toString can

produce

and trying to parse those, can I somehow extract the substrings of the

query

reliably?

Thanks!


-----Original Message-----
From: Jack Krupansky [mailto:[email protected]]
Sent: Wednesday, June 13, 2012 11:42 PM
To: [email protected]
Subject: Re: need to find locations of query hits in doc: works fine
for

regular

text but not for phone numbers

Try putting the phone number in quotes in the query:

String qstr = "\"800-555-1212\"";

And check query.toString to see how the query parser analyzed the
term,

bot

with and without quotes.

And make sure you initialized the query parser with "contents" as the

default

field.

-- Jack Krupansky

-----Original Message-----
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: [email protected]
Subject: need to find locations of query hits in doc: works fine for

regular text

but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document.
What

I've

implemented works fine for textual queries but does not work for phone
numbers.

Here's how I index my docs:

String oc = "Joe dialed 800-555-1212 but got a busy signal";
doc.add(new Field("contents", oc, Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit,
I

split my

query (in case it's multi-word) into words and search for each of them

using

TermFreqVector like this:


//String qstr = "my multiword query"; // for queries like this it
works

fine...

String qstr = "800-555-1212"; // ...but not for ones like this Query
query

parser.parse(qstr); TopDocs results = searcher.search(query,
Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split("\\s+"); // phone string stays intact
here

for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
            for (int j=0;j<tvoffsetinfo.length;j++) {
            int offsetStart = tvoffsetinfo[j].getStartOffset();
            int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...

For a query like "800-555-1212", tfvector.indexOf returns -1. What am
I

doing

wrong?

Thanks,

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Reply via email to