RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Ilya Zavorin
OK, so I figured out what the problem was. It wasn't with the digits but rather 
with the various delimiters like ( and - that I use.

Essentially, the statement 

String[] subTerms = qstr.split(\\s+);

Does not split a query the same way as the query parser would do it. And 
thanks, query.toString(), helped me see that.

My question now is this: is there a way of easily extracting a sequence of 
substrings from query to use in place of the subTerms array I get from split?

I see that sometimes query.toString() returns things like 

contents:800 contents:555 contents:1212 

but other times it's somehting like

contents:800 (contents:555 contents:1212)

So instead of trying to guess what other formats query.toString can produce and 
trying to parse those, can I somehow extract the substrings of the query 
reliably?

Thanks!


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, June 13, 2012 11:42 PM
To: java-user@lucene.apache.org
Subject: Re: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers

Try putting the phone number in quotes in the query:

String qstr = \800-555-1212\;

And check query.toString to see how the query parser analyzed the term, bot 
with and without quotes.

And make sure you initialized the query parser with contents as the default 
field.

-- Jack Krupansky

-Original Message-
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: java-user@lucene.apache.org
Subject: need to find locations of query hits in doc: works fine for regular 
text but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document. What I've 
implemented works fine for textual queries but does not work for phone numbers.

Here's how I index my docs:

String oc = Joe dialed 800-555-1212 but got a busy signal; doc.add(new 
Field(contents, oc, Field.Store.NO, Field.Index.ANALYZED, 
Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I split 
my query (in case it's multi-word) into words and search for each of them using 
TermFreqVector like this:


//String qstr = my multiword query; // for queries like this it works fine...
String qstr = 800-555-1212; // ...but not for ones like this Query query = 
parser.parse(qstr); TopDocs results = searcher.search(query, 
Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split(\\s+); // phone string stays intact here

for (int i = 0; i  hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, contents); 
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;jtvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...

For a query like 800-555-1212, tfvector.indexOf returns -1. What am I doing 
wrong?

Thanks,

Ilya Zavorin


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Uwe Schindler
Just take the BooleanQuery returned by the QueryParser and get its clauses
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that
you get all query components. In most cases some recursive instanceof
checking for various Query subclasses can do this.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Ilya Zavorin [mailto:izavo...@caci.com]
 Sent: Thursday, June 14, 2012 6:49 PM
 To: java-user@lucene.apache.org
 Subject: RE: need to find locations of query hits in doc: works fine for
regular
 text but not for phone numbers
 
 OK, so I figured out what the problem was. It wasn't with the digits but
rather
 with the various delimiters like ( and - that I use.
 
 Essentially, the statement
 
   String[] subTerms = qstr.split(\\s+);
 
 Does not split a query the same way as the query parser would do it. And
 thanks, query.toString(), helped me see that.
 
 My question now is this: is there a way of easily extracting a sequence of
 substrings from query to use in place of the subTerms array I get from
split?
 
 I see that sometimes query.toString() returns things like
 
   contents:800 contents:555 contents:1212
 
 but other times it's somehting like
 
   contents:800 (contents:555 contents:1212)
 
 So instead of trying to guess what other formats query.toString can
produce
 and trying to parse those, can I somehow extract the substrings of the
query
 reliably?
 
 Thanks!
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Wednesday, June 13, 2012 11:42 PM
 To: java-user@lucene.apache.org
 Subject: Re: need to find locations of query hits in doc: works fine for
regular
 text but not for phone numbers
 
 Try putting the phone number in quotes in the query:
 
 String qstr = \800-555-1212\;
 
 And check query.toString to see how the query parser analyzed the term,
bot
 with and without quotes.
 
 And make sure you initialized the query parser with contents as the
default
 field.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Ilya Zavorin
 Sent: Wednesday, June 13, 2012 10:52 PM
 To: java-user@lucene.apache.org
 Subject: need to find locations of query hits in doc: works fine for
regular text
 but not for phone numbers
 
 Hello All,
 
 I am using 3.4. I need to find locations of query hits in a document. What
I've
 implemented works fine for textual queries but does not work for phone
 numbers.
 
 Here's how I index my docs:
 
 String oc = Joe dialed 800-555-1212 but got a busy signal; doc.add(new
 Field(contents, oc, Field.Store.NO, Field.Index.ANALYZED,
 Field.TermVector.WITH_POSITIONS_OFFSETS));
 
 
 Now, here how I find locations. I search for a query. If I get a hit, I
split my
 query (in case it's multi-word) into words and search for each of them
using
 TermFreqVector like this:
 
 
 //String qstr = my multiword query; // for queries like this it works
fine...
 String qstr = 800-555-1212; // ...but not for ones like this Query query
=
 parser.parse(qstr); TopDocs results = searcher.search(query,
 Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
 
 String[] subTerms = qstr.split(\\s+); // phone string stays intact here
 
 for (int i = 0; i  hits.length; i++) {
 int docId = hits[i].doc;
 Document doc = searcher.doc(docId);
 
 TermFreqVector tfvector = reader.getTermFreqVector(docId, contents);
 TermPositionVector tpvector = (TermPositionVector)tfvector;
 
 for (String subTerm : subTerms)
 {
 String subq = subTerm.toLowerCase();
 int termidx = tfvector.indexOf(subq);  // get termidx = -1 here
 
 TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
 for (int j=0;jtvoffsetinfo.length;j++) {
 int offsetStart = tvoffsetinfo[j].getStartOffset();
 int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
 
 For a query like 800-555-1212, tfvector.indexOf returns -1. What am I
doing
 wrong?
 
 Thanks,
 
 Ilya Zavorin
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Chris Hostetter

: Subject: need to find locations of query hits in doc: works fine for regular
:  text but not for phone numbers
: Message-ID: a57498edec10c64781ea0f7dba665cef264de...@ex2010mb01-1.caci.com
: References: 1339635547170-3989548.p...@n3.nabble.com
: In-Reply-To: 1339635547170-3989548.p...@n3.nabble.com

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Ilya Zavorin


Uwe, sorry but I am having trouble understanding this. Can you point me to a 
place in documentation that explains this in more detail (I've read 
http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html
 but still am confused) or some example code?

Thanks much,

Ilya


-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses 
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that you 
get all query components. In most cases some recursive instanceof checking for 
various Query subclasses can do this.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Ilya Zavorin [mailto:izavo...@caci.com]
 Sent: Thursday, June 14, 2012 6:49 PM
 To: java-user@lucene.apache.org
 Subject: RE: need to find locations of query hits in doc: works fine 
 for
regular
 text but not for phone numbers
 
 OK, so I figured out what the problem was. It wasn't with the digits 
 but
rather
 with the various delimiters like ( and - that I use.
 
 Essentially, the statement
 
   String[] subTerms = qstr.split(\\s+);
 
 Does not split a query the same way as the query parser would do it. 
 And thanks, query.toString(), helped me see that.
 
 My question now is this: is there a way of easily extracting a 
 sequence of substrings from query to use in place of the subTerms 
 array I get from
split?
 
 I see that sometimes query.toString() returns things like
 
   contents:800 contents:555 contents:1212
 
 but other times it's somehting like
 
   contents:800 (contents:555 contents:1212)
 
 So instead of trying to guess what other formats query.toString can
produce
 and trying to parse those, can I somehow extract the substrings of the
query
 reliably?
 
 Thanks!
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Wednesday, June 13, 2012 11:42 PM
 To: java-user@lucene.apache.org
 Subject: Re: need to find locations of query hits in doc: works fine 
 for
regular
 text but not for phone numbers
 
 Try putting the phone number in quotes in the query:
 
 String qstr = \800-555-1212\;
 
 And check query.toString to see how the query parser analyzed the 
 term,
bot
 with and without quotes.
 
 And make sure you initialized the query parser with contents as the
default
 field.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Ilya Zavorin
 Sent: Wednesday, June 13, 2012 10:52 PM
 To: java-user@lucene.apache.org
 Subject: need to find locations of query hits in doc: works fine for
regular text
 but not for phone numbers
 
 Hello All,
 
 I am using 3.4. I need to find locations of query hits in a document. 
 What
I've
 implemented works fine for textual queries but does not work for phone 
 numbers.
 
 Here's how I index my docs:
 
 String oc = Joe dialed 800-555-1212 but got a busy signal; 
 doc.add(new Field(contents, oc, Field.Store.NO, 
 Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
 
 
 Now, here how I find locations. I search for a query. If I get a hit, 
 I
split my
 query (in case it's multi-word) into words and search for each of them
using
 TermFreqVector like this:
 
 
 //String qstr = my multiword query; // for queries like this it 
 works
fine...
 String qstr = 800-555-1212; // ...but not for ones like this Query 
 query
=
 parser.parse(qstr); TopDocs results = searcher.search(query, 
 Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
 
 String[] subTerms = qstr.split(\\s+); // phone string stays intact 
 here
 
 for (int i = 0; i  hits.length; i++) { int docId = hits[i].doc; 
 Document doc = searcher.doc(docId);
 
 TermFreqVector tfvector = reader.getTermFreqVector(docId, contents); 
 TermPositionVector tpvector = (TermPositionVector)tfvector;
 
 for (String subTerm : subTerms)
 {
 String subq = subTerm.toLowerCase();
 int termidx = tfvector.indexOf(subq);  // get termidx = -1 here
 
 TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
 for (int j=0;jtvoffsetinfo.length;j++) {
 int offsetStart = tvoffsetinfo[j].getStartOffset();
 int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
 
 For a query like 800-555-1212, tfvector.indexOf returns -1. What am 
 I
doing
 wrong?
 
 Thanks,
 
 Ilya Zavorin
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For 

Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Jack Krupansky

Look at this code: QueryTermExtractor.getTerms(Query query)
http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html

-- Jack Krupansky

-Original Message- 
From: Ilya Zavorin

Sent: Thursday, June 14, 2012 2:36 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers




Uwe, sorry but I am having trouble understanding this. Can you point me to a 
place in documentation that explains this in more detail (I've read 
http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html 
but still am confused) or some example code?


Thanks much,

Ilya


-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers


Just take the BooleanQuery returned by the QueryParser and get its clauses 
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that 
you get all query components. In most cases some recursive instanceof 
checking for various Query subclasses can do this.


Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Ilya Zavorin [mailto:izavo...@caci.com]
Sent: Thursday, June 14, 2012 6:49 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine
for

regular

text but not for phone numbers

OK, so I figured out what the problem was. It wasn't with the digits
but

rather

with the various delimiters like ( and - that I use.

Essentially, the statement

String[] subTerms = qstr.split(\\s+);

Does not split a query the same way as the query parser would do it.
And thanks, query.toString(), helped me see that.

My question now is this: is there a way of easily extracting a
sequence of substrings from query to use in place of the subTerms
array I get from

split?


I see that sometimes query.toString() returns things like

contents:800 contents:555 contents:1212

but other times it's somehting like

contents:800 (contents:555 contents:1212)

So instead of trying to guess what other formats query.toString can

produce

and trying to parse those, can I somehow extract the substrings of the

query

reliably?

Thanks!


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, June 13, 2012 11:42 PM
To: java-user@lucene.apache.org
Subject: Re: need to find locations of query hits in doc: works fine
for

regular

text but not for phone numbers

Try putting the phone number in quotes in the query:

String qstr = \800-555-1212\;

And check query.toString to see how the query parser analyzed the
term,

bot

with and without quotes.

And make sure you initialized the query parser with contents as the

default

field.

-- Jack Krupansky

-Original Message-
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: java-user@lucene.apache.org
Subject: need to find locations of query hits in doc: works fine for

regular text

but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document.
What

I've

implemented works fine for textual queries but does not work for phone
numbers.

Here's how I index my docs:

String oc = Joe dialed 800-555-1212 but got a busy signal;
doc.add(new Field(contents, oc, Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit,
I

split my

query (in case it's multi-word) into words and search for each of them

using

TermFreqVector like this:


//String qstr = my multiword query; // for queries like this it
works

fine...

String qstr = 800-555-1212; // ...but not for ones like this Query
query

=

parser.parse(qstr); TopDocs results = searcher.search(query,
Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split(\\s+); // phone string stays intact
here

for (int i = 0; i  hits.length; i++) { int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, contents);
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;jtvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...

For a query like 800-555-1212, tfvector.indexOf returns -1. What am
I

doing

wrong?

Thanks,

Ilya Zavorin



RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Ilya Zavorin

worked like a charm!

thx!


From: Jack Krupansky [j...@basetechnology.com]
Sent: Thursday, June 14, 2012 3:30 PM
To: java-user@lucene.apache.org
Subject: Re: need to find locations of query hits in doc: works fine for 
regular text but not for phone numbers

Look at this code: QueryTermExtractor.getTerms(Query query)
http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html

-- Jack Krupansky

-Original Message-
From: Ilya Zavorin
Sent: Thursday, June 14, 2012 2:36 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers



Uwe, sorry but I am having trouble understanding this. Can you point me to a
place in documentation that explains this in more detail (I've read
http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html
but still am confused) or some example code?

Thanks much,

Ilya


-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user@lucene.apache.org
Subject: RE: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that
you get all query components. In most cases some recursive instanceof
checking for various Query subclasses can do this.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Ilya Zavorin [mailto:izavo...@caci.com]
 Sent: Thursday, June 14, 2012 6:49 PM
 To: java-user@lucene.apache.org
 Subject: RE: need to find locations of query hits in doc: works fine
 for
regular
 text but not for phone numbers

 OK, so I figured out what the problem was. It wasn't with the digits
 but
rather
 with the various delimiters like ( and - that I use.

 Essentially, the statement

 String[] subTerms = qstr.split(\\s+);

 Does not split a query the same way as the query parser would do it.
 And thanks, query.toString(), helped me see that.

 My question now is this: is there a way of easily extracting a
 sequence of substrings from query to use in place of the subTerms
 array I get from
split?

 I see that sometimes query.toString() returns things like

 contents:800 contents:555 contents:1212

 but other times it's somehting like

 contents:800 (contents:555 contents:1212)

 So instead of trying to guess what other formats query.toString can
produce
 and trying to parse those, can I somehow extract the substrings of the
query
 reliably?

 Thanks!


 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Wednesday, June 13, 2012 11:42 PM
 To: java-user@lucene.apache.org
 Subject: Re: need to find locations of query hits in doc: works fine
 for
regular
 text but not for phone numbers

 Try putting the phone number in quotes in the query:

 String qstr = \800-555-1212\;

 And check query.toString to see how the query parser analyzed the
 term,
bot
 with and without quotes.

 And make sure you initialized the query parser with contents as the
default
 field.

 -- Jack Krupansky

 -Original Message-
 From: Ilya Zavorin
 Sent: Wednesday, June 13, 2012 10:52 PM
 To: java-user@lucene.apache.org
 Subject: need to find locations of query hits in doc: works fine for
regular text
 but not for phone numbers

 Hello All,

 I am using 3.4. I need to find locations of query hits in a document.
 What
I've
 implemented works fine for textual queries but does not work for phone
 numbers.

 Here's how I index my docs:

 String oc = Joe dialed 800-555-1212 but got a busy signal;
 doc.add(new Field(contents, oc, Field.Store.NO,
 Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));


 Now, here how I find locations. I search for a query. If I get a hit,
 I
split my
 query (in case it's multi-word) into words and search for each of them
using
 TermFreqVector like this:


 //String qstr = my multiword query; // for queries like this it
 works
fine...
 String qstr = 800-555-1212; // ...but not for ones like this Query
 query
=
 parser.parse(qstr); TopDocs results = searcher.search(query,
 Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

 String[] subTerms = qstr.split(\\s+); // phone string stays intact
 here

 for (int i = 0; i  hits.length; i++) { int docId = hits[i].doc;
 Document doc = searcher.doc(docId);

 TermFreqVector tfvector = reader.getTermFreqVector(docId, contents);
 TermPositionVector tpvector = (TermPositionVector)tfvector;

 for (String subTerm : subTerms)
 {
 String subq = subTerm.toLowerCase();
 int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

 TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
   

need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-13 Thread Ilya Zavorin
Hello All,

I am using 3.4. I need to find locations of query hits in a document. What I've 
implemented works fine for textual queries but does not work for phone numbers. 

Here's how I index my docs:

String oc = Joe dialed 800-555-1212 but got a busy signal;
doc.add(new Field(contents, 
oc, 
Field.Store.NO,
Field.Index.ANALYZED, 
Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I split 
my query (in case it's multi-word) into words and search for each of them using 
TermFreqVector like this:


//String qstr = my multiword query;   // for queries like this it works 
fine...
String qstr = 800-555-1212;   // ...but not for ones like this
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split(\\s+); // phone string stays intact here

for (int i = 0; i  hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, contents);  
TermPositionVector tpvector = (TermPositionVector)tfvector;   

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = 
tpvector.getOffsets(termidx);  
for (int j=0;jtvoffsetinfo.length;j++) {  
int offsetStart = tvoffsetinfo[j].getStartOffset();  
int offsetEnd = tvoffsetinfo[j].getEndOffset(); 
// ...

For a query like 800-555-1212, tfvector.indexOf returns -1. What am I doing 
wrong? 

Thanks,

Ilya Zavorin


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-13 Thread Jack Krupansky

Try putting the phone number in quotes in the query:

String qstr = \800-555-1212\;

And check query.toString to see how the query parser analyzed the term, bot 
with and without quotes.


And make sure you initialized the query parser with contents as the 
default field.


-- Jack Krupansky

-Original Message- 
From: Ilya Zavorin

Sent: Wednesday, June 13, 2012 10:52 PM
To: java-user@lucene.apache.org
Subject: need to find locations of query hits in doc: works fine for regular 
text but not for phone numbers


Hello All,

I am using 3.4. I need to find locations of query hits in a document. What 
I've implemented works fine for textual queries but does not work for phone 
numbers.


Here's how I index my docs:

String oc = Joe dialed 800-555-1212 but got a busy signal;
doc.add(new Field(contents,
oc,
Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I 
split my query (in case it's multi-word) into words and search for each of 
them using TermFreqVector like this:



//String qstr = my multiword query; // for queries like this it works 
fine...

String qstr = 800-555-1212; // ...but not for ones like this
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split(\\s+); // phone string stays intact here

for (int i = 0; i  hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, contents);
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq);  // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
   for (int j=0;jtvoffsetinfo.length;j++) {
   int offsetStart = tvoffsetinfo[j].getStartOffset();
   int offsetEnd = tvoffsetinfo[j].getEndOffset();
// ...

For a query like 800-555-1212, tfvector.indexOf returns -1. What am I 
doing wrong?


Thanks,

Ilya Zavorin


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org